HS Problems

HS 1 (In danish)

1.

Vi kører de efterspurgte kommandoer

## library(MASS)
## cats
head(cats)
##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6
dim(cats)
## [1] 144   3
summary(cats)
##  Sex         Bwt             Hwt       
##  F:47   Min.   :2.000   Min.   : 6.30  
##  M:97   1st Qu.:2.300   1st Qu.: 8.95  
##         Median :2.700   Median :10.10  
##         Mean   :2.724   Mean   :10.63  
##         3rd Qu.:3.025   3rd Qu.:12.12  
##         Max.   :3.900   Max.   :20.50

Vi kan samtidigt også importere cats et datasæt i R i stedet for som en del af cats pakken, idet vi erklærer;

cats <- cats

2.

Bemærk at vi i cats datasættet har med forskellige enheder at gøre. Specifikt er det således at Bwt er i kilogram, mens Hwt er i gram. For at udregne forholdet mellem Hwt og Bwt, for så da at kunne omsætte til procent, må vi altså have dem på samme enhed. I dette tilfælde vil vi regne i kg, så vi ganger Hwt med \(1=\frac{1kg}{1000g},\) for da at tage forholdet Hwt/Bwt og tilsidst gange med \(100\) for at få resultatet i procent. Den resulterende transformation, efterfulgt af et fornyet kig på head(cats)

cats <- transform(cats, pct = (cats$Hwt/1000)*(1/cats$Bwt)*100)
head(cats)
##   Sex Bwt Hwt       pct
## 1   F 2.0 7.0 0.3500000
## 2   F 2.0 7.4 0.3700000
## 3   F 2.0 9.5 0.4750000
## 4   F 2.1 7.2 0.3428571
## 5   F 2.1 7.3 0.3476190
## 6   F 2.1 7.6 0.3619048

Således at vi eksempelvis i række to har at hjertevægten af kat nummer to udgør smålige \(0,37\%\) af kropsvægten af katten - overraskende lidt.

3.

As requested we may subset cats with the subset function, remembering that any gentleman always defines femaleData first;

femaleData <- subset(cats, cats$Sex == "F")
maleData <- subset(cats, cats$Sex == "M")
head(femaleData)
##   Sex Bwt Hwt       pct
## 1   F 2.0 7.0 0.3500000
## 2   F 2.0 7.4 0.3700000
## 3   F 2.0 9.5 0.4750000
## 4   F 2.1 7.2 0.3428571
## 5   F 2.1 7.3 0.3476190
## 6   F 2.1 7.6 0.3619048
head(maleData)
##    Sex Bwt  Hwt       pct
## 48   M 2.0  6.5 0.3250000
## 49   M 2.0  6.5 0.3250000
## 50   M 2.1 10.1 0.4809524
## 51   M 2.2  7.2 0.3272727
## 52   M 2.2  7.6 0.3454545
## 53   M 2.2  7.9 0.3590909

HS 2

1.

We will be testing a linear regression model on the data prepared in HS1. Starting off with a (Bwt,Hwt) scatterplot, note once again that Hwt is in units of grams, with Bwt being in kilograms.

plot(cats$Bwt,cats$Hwt, main = "(Bwt, Hwt) scatterplot", xlab = "Cat bodyweight in kg", ylab = "Cat heartweight in g")

The statistical model behind the linear regression may be recited from Definition 6.1 of “Introduktion til Statistik.”

We assume \(Y_1,\ldots,Y_n \overset{\text{iid.}}{\sim}\mathcal{N}(\alpha+\beta x_i,\sigma^2),\,\)with \(\alpha,\beta\in\mathbb{R},\,\sigma^2>0\) being unknown parameters.

Finally we will for maleData be using the lm-function in to fit the linear regression model to the maleData, assuming heartweight to be a linear function of bodyweight such that we might write Hwt \(= \alpha + \beta\,\cdot\)Bwt.

linreg <- lm(Hwt ~ Bwt, data = maleData)

2.

Further following the procedures of Chapter 6 (in particular page 133) in “Introduktion til Statistik” we might complete a visual model validation of our linear regression model by getting to calculate the estimated values of the model, and the standardized residuals of the data also. Respectively, this will be done with the assignment

fit <- fitted(linreg)
rst <- rstandard(linreg)

such that we might plot the fitted values against the standardized residuals, adding a horizontal line through zero also

plot(fit, rst, main = "(Estimate, Std. Res.)-plot for linreg model on maleData", xlab = "Estimated (fitted) Heartweight (g)", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) ) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
abline(0,0)

Looking at the above plot, we note that if the assumption that the variance of the heartweight data is independent of the mean is true, we would expect that the dispersion of points at region of fitted heartweight values would be akin to the dispersion in any other region.If the linearity assumption of our model were to be true, we would expect the points to be relatively equally dispersed as \(\mathcal{N}(0,1)\) on both sides of \(0\) across the fitted values. - ie. that the standardized residuals are themselves rather \(\mathcal{N}(0,1)\) distributed.

The plot looks rather respectable on both these fronts, as 1. The residual values fall within the interval \(\left({-2,2}\right),\) as would be expected of a standard normal distribution. 2. No wierd “trumpet,” “quadratic” or other wierd distortions to the std. residuals values are present, they seem to keep rather constantly dispersed around zero within any fitted heartweight region.

Note that the values of

mean(rst)
## [1] 0.0004535485
var(rst)
## [1] 1.015539

fit in line.

Finally the assumption that the heartweight data is normally distributed may be visually assessed through the use of a QQ-plot of the standardized residuals. If the normallity assumption of the model were to be true we would expect the standardized residuals to be \(\mathcal{N}(0,1)\) distributed, which would reveal itself if the points if the theoretical \(\mathcal{N}(0,1)\) quantiles were to match the empirical quantiles of the standardized residuals, such that the points of the QQ-plot were to hug the line intersecting the mean \(0\) and with slope equal to the standard deviation of an \(\mathcal{N}(0,1)\) distributed value, ie. with slope \(\sqrt{1}=1\). The plot and the corresponding line intersecting \(0\) and having slope \(1\) can be made with the commands below

qqnorm(rst)
abline(0,1)

revealing that no sufficiently large abnormalities seem to be present that would disprove our theory that the standardized residuals are \(\mathcal{N}(0,1)\) distributed, and thus, that the data follows a normal distribution also. Do however observe, that making a QQ-plot of the data itself won’t lead to much fruition, as the means are assumed to vary according to the bodyweight (unless ofcourse we have \(\beta=0\)).

All in all, we say that it seems fairly reasonable to assume that the data is distributed according to the above written model.

3.

A valueable way to get insight into the linear regression model made by is to deploy the summary function.

summary(linreg)
## 
## Call:
## lm(formula = Hwt ~ Bwt, data = maleData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7728 -1.0478 -0.2976  0.9835  4.8646 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.1841     0.9983  -1.186    0.239    
## Bwt           4.3127     0.3399  12.688   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.557 on 95 degrees of freedom
## Multiple R-squared:  0.6289, Adjusted R-squared:  0.625 
## F-statistic:   161 on 1 and 95 DF,  p-value: < 2.2e-16

which reveals a rather lot of details around the fitted linear model. Information of how to interpret the output of summary may be found through the ?lm-command.

Another valuable trick is that it is possible to draw out the slope(\(\beta\)) and intersect (\(\alpha\)) values for the linear regression of the data, and use these to define local variables beta and alpha through c(1,2,3,4)

alpha <- linreg$coefficients[[1]]
## [1] -1.184088
beta <- linreg$coefficients[[2]]
## [1] 4.312679

Noting the use of [[]] to extract the numerical value instead of a named value as results from using

linreg$coefficients[1]
## (Intercept) 
##   -1.184088
linreg$coefficients[2]
##      Bwt 
## 4.312679

Alternatively the format,

as.numeric(linreg$coefficients[1])
## [1] -1.184088
as.numeric(linreg$coefficients[2])
## [1] 4.312679

evidently works as well.

We have thus established the model that maleData$Hwt = -1.1840879 + 4.3126787 maleData$Bwt. As the rate change of an affine function is constant we may just note that an increase in bodyweight of \(0.5kg,\) will by the model provide an 0.5 beta = 2.1563394gram increase in heartweight.

Note that the model is rather imperfect, as it for example for maledata$Bwt \(\searrow 0 kg\) estimates maleData$Hwt \(\searrow -1.1840879 g\). And while the above problem to the model is of course in itself preposterous, we may note that if we wanted to force to fit the best linear model to the data assuming intersection at zero, we might’ve done this through the use of linreg <- lm(Hwt ~ Bwt - 1, data = maleData) ie. with the inclusion of “-1” being the symbolic manipulation required to tell to model based on the assumption of intersection in \(0.\)

4.

Let us just remind ourselves of the standard frequentistic interpretation ascribed to confidence intervals of level \(1-\alpha^*\) for the parameter \(\beta\):

Imagine the experiment being conducted multiple times in the same way, but independent of the original and each other - in our case a new batch of cats being weighed, their sex noted, and their heartweight noted also. Using the dataset corresponding to each repetition of experiment we calculate the \(1-\alpha^*\) confidence intervals for \(\beta\) (as will be done below). We would then expect the true value of \(\beta\) to be present in approximately ratio \(1-\alpha^*\) of all the intervals calculated.

From Theorem 6.9 in “Introduktion to Statistik” we have that an \(1-\alpha^*\) confidence-interval for \(\beta\) can be found as the interval bounded by \[\hat{\beta}\pm t_{n-2,1-alpha^*/2}SE(\hat\beta)\] Note that we may find the standard error for \(\beta,\,SE(\hat\beta)\) in summary(linreg) as “std error for Bwt,” which is 0.3399 in our instance. Extraction of this value might occur via

seb <- sqrt(diag(vcov(linreg)))[[2]]
## [1] 0.3398942

Having taken a look at ?qt as well as an extra look at Theorem 6.9, we may in order to get an \(95\%\) confidence interval do

n <- nrow(maleData)
## [1] 97
lKI <- beta - qt(0.975, n-2)*seb
uKI <- beta + qt(0.975, n-2)*seb
KI <- c(lKI, uKI)
## [1] 3.637904 4.987454

ie we get the \(95\%\) confidence-interval (3.6379035,4.987454).

5.

Having browsed ?confint we may refind the results of the previous subproblem with confint(linreg) as follows;

confint(linreg)
##                 2.5 %    97.5 %
## (Intercept) -3.165939 0.7977636
## Bwt          3.637904 4.9874540
#We try to extract the interval;
confint(linreg)[2,]
##    2.5 %   97.5 % 
## 3.637904 4.987454
as.numeric(confint(linreg)[2,]) #KI
## [1] 3.637904 4.987454
confint(linreg)[2,1] #lKI
## [1] 3.637904
confint(linreg)[2,2] #uKI
## [1] 4.987454

6.

For \(x_i\) being the bodyweight data and \(Y_i\) being the stochastic variable representing the heartweight both of the \(i\)’th male cat, we will by assumption have \(E(Y_i)=\alpha + \beta x_i.\) Finding a designmatrix \(A\) such that we way write \(\xi:=\left({E(Y_1), E(Y_2), \ldots, E(Y_{97})}\right)^T\) as \[ \xi=A\begin{pmatrix} \alpha\\ \beta \end{pmatrix}. \] Consequently, as \(\xi\in M_{97,1}(\mathbb{R}),\,\left({\alpha, \beta}\right)^T\in M_{2,1}(\mathbb{R}),\) we will by matrix multiplication need to have \(A\in M_{97,2}(\mathbb{R}),\) and in particular for \[\begin{align*} A\begin{pmatrix} \alpha\\ \beta \end{pmatrix} \overset{\swarrow}{\equiv} \begin{pmatrix} a_{1,1} & a_{1,2} \\ a_{2,1} & a_{2,2} \\ \vdots & \vdots \\ a_{97,1} & a_{97,2} \end{pmatrix}\begin{pmatrix} \alpha\\ \beta \end{pmatrix}&\overset{\swarrow}{=} \begin{pmatrix} a_{1,1}\alpha + a_{1,2}\beta \\ a_{2,1}\alpha + a_{2,2}\beta \\ \vdots \\ a_{97,1}\alpha+a_{97,2}\beta \end{pmatrix}\overset{\searrow}{=}\begin{pmatrix} \alpha + \beta x_1 \\ \alpha + \beta x_2 \\ \vdots \\ \alpha + \beta x_{97} \\ \end{pmatrix}\overset{\searrow}{\equiv}\xi\\ &\Leftrightarrow\\ a_{i,1}\equiv 1&,\,\,a_{i,2}=x_i\\ &\Leftrightarrow\\ A&=\begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_{97} \end{pmatrix} \end{align*}\]

HS 3

1.

We may recite the (normal) statistical model of two independent samples as given in D. 5.1 of “Introduktion til Statistik”;

Let \(X_1,\ldots,X_{n_1}\) and \(Y_1,\ldots,Y_{n_2}\) be independent normally distributed random variables with \(X_i\sim\mathcal{N}(\mu_1,\sigma^2)\) and \(Y_j\sim\mathcal{N}(\mu_2,\sigma^2),\) with \(\mu_1,\mu_2\in\mathbb{R}\) and \(\sigma^2>0\) being unknown parameters

2.

We will calculate estimates for the mean and variance of every imaginable parameter below

mean(femaleData$Bwt) #2.359574
mean(femaleData$Hwt) #9.202128
mean(femaleData$pct) #0.3915119

#--

mean(maleData$Bwt) #2.9
mean(maleData$Hwt) #11.32268
mean(maleData$pct) #0.3894547

#-

var(femaleData$Bwt) #0.07506938
var(femaleData$Hwt) #1.843256
var(femaleData$pct) #0.002630804

#--

var(maleData$Bwt) #0.2185417
var(maleData$Hwt) #6.46323
var(maleData$pct) #0.002858461

Though noting that we for our particular analysis happen to be interested in the data femaleData$pct and maleData$pct each a sample of the entire cats thus making two samples.

3.

Note that our model has three primary assumptions;

  1. The observations in cats$pct are independent

  2. The variance of the femaleData$pct is equal the variance of the maleData$pct (variance homogeneity)

  3. The observations are normally distributed.

The independence assumption will usually follow from the method by which the experiment is conducted and the data is collected.

Observe from the above calculations in particular that

mean(femaleData$pct) #0.3915119
mean(maleData$pct) #0.3894547

var(femaleData$pct) #0.002630804
var(maleData$pct) #0.002858461

the latter pair fitting rather well with our assumption of variance homogeneity between femaleData$pct and maleData$pct - more precisely we may say that on the basis of data, we do not yet see any reason to doubt variance homogeneity as an assumption.

Note that we might define the following histogram-making, and normal density overdrawing function, to then make a histogram with density estimates overlayed for each of femaleData$pct,maleData$pct

NHistDensityDraw<-function(varname, xlab = deparse(substitute(varname)), main = paste("Histogram of", deparse(substitute(varname))), ...) {
  
  #Draws normaldistribution density on top of a histogram.
  
  seqT<-seq(min(varname),max(varname), by = 1/(5*10^4*(max(varname)-min(varname)))) #Counts from minimum of data to maximum of data via the by mechanism
  
  f1T<-dnorm(seqT,mean=mean(varname),sd=sd(varname)) #creating density
  maxf1T <- max(f1T)
  hist(varname, prob=1, ylim=c(0,maxf1T), xlab = xlab, main = main, ...) #Drawing histogram, making sure we get the entire height of the density estimate
  
  lines(seqT,f1T) #Drawing density
}

NHistDensityDraw(femaleData$pct)
NHistDensityDraw(maleData$pct)

which also doesn’t provide a cause for concern to our model assumptions. Getting to the meat of of the modelcontrol, we may introduce qq-plots for the two datasets

qqnorm(femaleData$pct, main = "Female QQ Plot")
abline(mean(femaleData$pct),sd(femaleData$pct))

qqnorm(maleData$pct, main = "Male QQ Plot")
abline(mean(maleData$pct),sd(maleData$pct))

Heed that in both plots, the data seems to hug the empirical line rather well, thus providing no immediate refutiation of femaleData$pct or maleData$pct being normally distributed.

4.

Looking through t.test we may note that a \(95\%\) confidence interval for the expected difference bewteen femaleData$pct and maleData$pct may be obtained through

t.test(femaleData$pct, maleData$pct, paired = F, var.equal = T)
## 
##  Two Sample t-test
## 
## data:  femaleData$pct and maleData$pct
## t = 0.21935, df = 142, p-value = 0.8267
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01648249  0.02059684
## sample estimates:
## mean of x mean of y 
## 0.3915119 0.3894547

noting that we have specified that we are not dealing with a paired two sample - we assume independence between the two groups of female, male data respectively. - Someplace we find a paired analysis is in drugtrials with the groupings “before taking drug” and “after taking drug.” As we sample twice from each person, a rather stark dependence between the samples occures, as people are rather non-independent of themselves.

We also include var.equal = T, thus telling that we assume variance homogeneity between the two samples.

Note that we may extract the confidence interval via

KI <- as.numeric(t.test(femaleData$pct, maleData$pct, paired = F, var.equal = T)[[4]])
KI
## [1] -0.01648249  0.02059684
lKI <- KI[1]
uKI <- KI[2]

Note that as \(0\) is firmly placed in the confidence interval, there is not immidiately any evidence to say that there is a percentagewise difference in heartweight/bodyweight between male and female cats.

5.

Having looked at ?confint we may use the requested commands;

fit <- lm (pct ~ Sex, data=cats)
summary(fit)
## 
## Call:
## lm(formula = pct ~ Sex, data = cats)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.129012 -0.038876 -0.003071  0.034442  0.136186 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.391512   0.007697  50.863   <2e-16 ***
## SexM        -0.002057   0.009379  -0.219    0.827    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05277 on 142 degrees of freedom
## Multiple R-squared:  0.0003387,  Adjusted R-squared:  -0.006701 
## F-statistic: 0.04811 on 1 and 142 DF,  p-value: 0.8267
confint(fit)
##                   2.5 %     97.5 %
## (Intercept)  0.37629566 0.40672807
## SexM        -0.02059684 0.01648249

We may then for example extract the residual standard error through

rstderr <- sqrt(deviance(fit)/df.residual(fit))
## [1] 0.05277038
rstderr^2
## [1] 0.002784713

6.

For \(Z_1,\ldots,Z_{144}\) being the stochastic variables representing pct for the 144 cats, we have assumed \(E(Z_i)=\mu_1,\,E(Z_j)=\mu_2\) for \(i=1,\ldots,97,\,j=98,\ldots,144\) respectively representing the male and female cats. We may proceed as with HS2.6 as we endevour to write a designmatrix \(C\) such that for \(\eta:=\left({E(Z_1),\ldots,E(Z_{144})}\right)^T\equiv\left({\mu_1,\ldots,\mu_1,\mu_2,\ldots,\mu_2}\right)\)

\[ \eta=C\begin{pmatrix} \mu_1\\ \mu_2 \end{pmatrix}. \] Consequently, as \(\eta\in M_{144,1}(\mathbb{R}),\,\left({\mu_1, \mu_2}\right)^T\in M_{2,1}(\mathbb{R}),\) we will by matrix multiplication need to have \(C\in M_{144,2}(\mathbb{R}),\) and in particular for \[\begin{align*} C\begin{pmatrix} \mu_1\\ \mu_2 \end{pmatrix} \overset{\swarrow}{\equiv} \begin{pmatrix} c_{1,1} & c_{1,2} \\ c_{2,1} & c_{2,2} \\ \vdots & \vdots \\ c_{144,1} & c_{144,2} \end{pmatrix}\begin{pmatrix} \mu_1\\ \mu_2 \end{pmatrix}&\overset{\swarrow}{=} \begin{pmatrix} c_{1,1}\mu_1 + c_{1,2}\mu_2 \\ c_{2,1}\mu_1 + c_{2,2}\mu_2 \\ \vdots \\ c_{144,1}\mu_1 + c_{144,2}\mu_2 \end{pmatrix}\overset{\searrow}{=}\begin{pmatrix} \mu_1\\ \vdots\\ \mu_1\\ \mu_2\\ \vdots\\ \mu_2 \end{pmatrix}\overset{\searrow}{\equiv}\eta\\ &\Leftrightarrow\\ c_{i,1}\equiv 1_{\left\{{1,\ldots,97}\right\}}(i)&,\,\,c_{i,2}=1_{\left\{{98,\ldots,144}\right\}}(i)\\ &\Leftrightarrow\\ C&=\begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 0 & 1 \\ \vdots & \vdots \\ 0 & 1 \end{pmatrix} \end{align*}\]

HS 4

1.

2.

3.

HS 5

1.

2.

3.

4.

5.

HS 6

1.

Note that for \(X_1,...,X_{10} \sim\mathcal{N}(0,1),\) we may, for example by NRH problem 1.1, convolution results, or C9.46 in EH (MatStat Bind 2) conclude regarding the desired distribution. Define in accordance with C9.46 \(C:=\frac{1}{n}\left({1,\ldots,1}\right)\in\mathbb{R}^{1\times10},\,X:=\left({X_1,\ldots,X_{10}}\right)^T\) that \(\overline{X}\equiv\frac{1}{n}\sum_{i=1}^{n}{X_i}\equiv CX \sim \mathcal{N}_1(C\cdot0,CIC^T)=\mathcal{N}_1(0,\frac{1}{n^2}n)=\mathcal{N_1}(0,\frac{1}{n}),\) which is also in accordance with the result reached in NRH1.1.

2.

We may test this result through gent number of simulations of \(10\) outcomes of \(\mathcal{N}(0,1),\) that we will be averaging over;

n<-10
gent <- 5*10^3
gns <- rep(NA,gent)

for (i in 1:gent) {
  X<-rnorm(n,0,1)
  gns[i] = 1/n*sum(X)
}

hist(gns, prob = T, ylim = c(0.0,1.3))
f<- function(x) dnorm(x,mean=0,sd=1/sqrt(n)) 
plot(f,-1,1,add=T)

Note also that we might we might create a QQ-plot for the data;

qqnorm(gns)
abline(0,1/sqrt(n))

for which we see that by Introduktion til Statistik p. 68, everything is as it should be.

3.

For each outcome of \(X_1,\ldots,X_{10}\) as \(x_1,\ldots,x_{10},\) denote by \(x_{(1)}\leq x_{(2)}\leq\ldots\leq x_{(10)}\) the ordering of the outcomes. As such the median \(\widehat{X},\) of \(X_1,\ldots,X_{10}\) will equal to \(\frac{x_{(5)}+x_{(6)}}{2},\) and can be calculated with median -command.

So, lets reimplement the above method;

med<-rep(NA,gent)
for (i in 1:gent) {
   X <- rnorm(n, 0, 1)
   med[i] <- median(X)
}
hist(med, prob=TRUE, ylim = c(0.0,1.25)) # Normeret histogram
f <- function(x) dnorm(x, mean=0, sd=1/sqrt(n))
plot(f,-1,1,add=TRUE)

#Den tegnede tæthed passer vist ikke helt, idet den designerede tæthed lader til at være mere sammentrykket (højere modus) end histogrammet
hist(med, prob=TRUE, ylim = c(0.0,1.25))
f_2<-function(x) dnorm(x, mean= 0, sd = sd(med))
#Med den empiriske lader det til at gå lidt bedre.
plot(f_2,-1.2,1.2,add = T)

qqnorm(med)
#qqline(med)
mean(med)
## [1] 0.01144251
sd(med)
## [1] 0.3652325
abline(0,sd(med))

With mean(med)= 0.0114425 and sd(med)= 0.3652325, it seems reasonable by the QQ-plot to assume that the median is going to be distributed approx. as \(\mathcal{N}(0,0.1333948).\)

Comparing the estimates for the variance of \(\overline{X}\) and \(\widehat{X}\) we see

var(gns)
## [1] 0.1001461
var(med)
## [1] 0.1333948

with the variance of the mean generally being a good bit lower than the variance of the median.

We may do a scatterplot of the mean against the median as follows;

plot(x = gns, y = med, xlim = c(-1.1,1.1), ylim = c(-1.1,1.1))

We may also use ?cor to understand that

cor(gns, med)
## [1] 0.02536134

is to be interpreted as …

HS 7

1.

For \(X\) being a two dimensional stochastic variable with \[EX=\begin{pmatrix}{1\\0}\end{pmatrix},\,VX= \begin{pmatrix} 4&4\\ 4&16 \end{pmatrix} \] that as \(\begin{pmatrix}{1\\0}\end{pmatrix}=EX \equiv E\begin{pmatrix}{X_1\\X_2}\end{pmatrix} \equiv \begin{pmatrix}{EX_1\\EX_2}\end{pmatrix}\) we get \(EX_1=1,\,EX_2=0.\) In the same way \[ \begin{pmatrix} 4&4\\ 4&16 \end{pmatrix}=VX\equiv\begin{pmatrix} VX_1&cov(X_1,X_2)\\ cov(X_2,X_1)&VX_2 \end{pmatrix}\equiv \begin{pmatrix} VX_1&cov(X_1,X_2)\\ cov(X_1,X_2)&VX_2 \end{pmatrix} \] grants us \(VX_1=4,\,VX_2=16,\,Cov(X_1,X_2)\equiv Cov(X_2,X_1) = 4,\) and thus \[ Corr(X_1,X_2):=\frac{Cov(X_1,X_2)}{\sqrt{VX_1\,VX_2}}\overset{\cdot}{=}\frac{4}{\sqrt{64}}=\frac{4}{8}=\frac{1}{2}. \]

2.

For \(Y\equiv\begin{pmatrix}{Y_1\\Y_2}\end{pmatrix}:=\begin{pmatrix}{X_1+2\\X_2-X_1+1}\end{pmatrix},\) we get \[ EY\equiv\begin{pmatrix}{EY_1\\EY_2}\end{pmatrix}\equiv\begin{pmatrix}{E(X_1+2)\\E(X_2-X_1+1)}\end{pmatrix}=\begin{pmatrix}{EX_1+2\\EX_2-EX_1+1}\end{pmatrix}\overset{\cdot}{=}\begin{pmatrix}{1+2\\0-1+1}\end{pmatrix}\equiv\begin{pmatrix}{3\\0}\end{pmatrix}. \]

In calculating the matrix \(VY\) we may use one of two methods;

Method 1: The Direct method

We also have \[ VY_1\equiv V\mathopen{}\left({X_1+2}\right)\mathclose{}\overset{\text{MI L16.17}}{=}VX_1=4, \] and \[ VY_2\equiv V\left({X_2-X_1+1}\right)\overset{\text{MI L16.17}}{=}V\left({X_2-X_1}\right). \]

MI L19.13 expansion

Let \(Z,Q\) being r.v.r.v.’s variables defined on a common background space \(\left({\Omega,\mathbb{F}, P}\right).\) If both \(Z,Q\) have second moments then \[ V\left({Z\pm Q}\right) = VZ + VQ \pm 2Cov(Z,Q) \]

As both \(X_1,X_2\) have second moment, such that \(\left({X_1,X_2}\right)\) by MI E19.8 has covariance, we may thus by the expansion of MI L19.13 above conclude \[ VY_2 = V\left({X_2 - X_1}\right) \overset{\text{L19.13}}{=}VX_2+VX_1-2Cov(X_2,X_1)\underset{\text{MI L19.12}}{\overset{\cdot}{=}} 4+16-2\cdot4=12. \]

MI L19.12 expansion

Let \(Z,Q,W\) being r.v.r.v.’s variables defined on a common background space \(\left({\Omega,\mathbb{F}, P}\right).\) If both \(\left({Z,Q}\right),\,\left({Z,W}\right)\) have covariance then \[ Cov(Z,Q\pm W) = Cov(Q\pm W, Z) = Cov(Q,Z)\pm Cov(W,Z) = Cov(Z,Q)\pm Cov(Z,W) \]

Having shown above that \(Y_1\) and \(Y_2\) both have finite variance and thus finite second moments, we may by the MI L19.12 expansion above conclude \[\begin{align} Cov(Y_1,Y_2)&\equiv Cov(X_1+2,X_2-X_1+1)\\ \overset{\text{MI (19.8)}}{=}&Cov(X_1,X_2-X_1)\\ \overset{\text{MI L19.12}}{=}&Cov(X_1,X_2)-Cov(X_1,X_1)\\ \equiv& Cov(X_1,X_2)-VX_1\overset{\cdot}{=}4-4=0 \end{align}\]

As such we altogether get \[ VY\equiv\begin{pmatrix}4&0\\0&12\end{pmatrix}. \]

Method 2; The Matrix method

For \(A:=\begin{pmatrix}1&0\\-1&1\end{pmatrix},\,b:=\begin{pmatrix}{2\\1}\end{pmatrix}\) note that \(Y=AX+b.\) As such by \[\begin{align} VY &\equiv V\left({AX+b}\right)\\ &= V\left({AX}\right)\\ &= AVXA^T\\ &\equiv\begin{pmatrix}1&0\\-1&1\end{pmatrix}\begin{pmatrix}4&4\\4&16\end{pmatrix}\begin{pmatrix}1&-1\\0&1\end{pmatrix}\\ &=\begin{pmatrix}1&0\\-1&1\end{pmatrix}\begin{pmatrix}4&0\\4&12\end{pmatrix}\\ &=\begin{pmatrix}4&0\\0&12\end{pmatrix} \end{align}\]

3.

No we cannot conclude that \(Y_1,Y_2\) are independent just because they have covariance \(0.\) In general we only have \[ Y_1,Y_2\text{ independent} \Rightarrow Cov(Y_1,Y_2)=0 \] and only (?) in the case in which \(Y_1,Y_2\sim\mathcal{N}\) do we have \[ Y_1,Y_2\text{ independent and normally distributed} \Leftrightarrow Cov(Y_1,Y_2)=0. \]

4.

For \(A:=\begin{pmatrix}1&0\\-1&1\end{pmatrix},\,b:=\begin{pmatrix}{2\\1}\end{pmatrix}\) note that \(Y=AX+b,\) as presented in the matrix-method of subproblem2. By EH K9.46 as \(X\) is now assumed regularly normally distributed, we will thus have \(Y\sim\mathcal{N}\left({{A\xi+b},{AVXA^T}}\right)=\mathcal{N}\left({{EY},{VY}}\right)\overset{\cdot}{=}\mathcal{N}\left({{\begin{pmatrix}{3\\0}\end{pmatrix}},{\begin{pmatrix}4&0\\0&12\end{pmatrix}}}\right).\)

HS 8

1.

Note that we might define new vectors \[ \tilde{p}_0(x):=p_0(x)\equiv1,\quad \tilde{p}_1(x):=p_1(x)\equiv x,\quad \tilde{p}_2(x):=\frac{p_2(x)+p_0(x)}{3}\overset{\cdot}{=}\frac{3x^2-1+1}{3}=x^2, \] which most definitely span \(\mathscr{P}_2\equiv\left\{{f:\,\left({-1,1}\right)\rightarrow\mathbb{R}^2|\exists a_0,a_1,a_2\in\mathbb{R},\,f(x)=a_2x^2+a_1*x+a_0}\right\}\), and as we have transfered from \(\left({p_0,p_1,p_2}\right)\) to \(\left({\tilde{p}_0,\tilde{p}_1,\tilde{p}_2}\right)\) using only linear combinations, we will also have that we may therefore conclude that \(\left({p_0,p_1,p_2}\right)\) must too be a basis for \(\mathscr{P}_2\)

2.

p0 <- function(x) 1
p1 <- function(x) x
p2 <- function(x) 3*x^2 - 1
p12 <- function(x) p1(x)*p2(x)
integrate(p1,-1,1)
## 0 with absolute error < 1.1e-14
integrate(p2,-1,1)
## -2.775558e-17 with absolute error < 1.7e-14
integrate(p12,-1,1)
## 0 with absolute error < 9.2e-15

3.

Note as described in the problem that as \(\left\lVert p_0\right\rVert^2=2,\,\left\lVert p_1\right\rVert^2=\frac{2}{3},\,\left\lVert p_2\right\rVert=\frac{8}{5},\) we may define an orthonormal basis for \(\mathscr{P}_2\) as \begin{equation} e_0(x):=\frac{1}{\sqrt{2}},\quad e_1(x)=\sqrt{\frac{3}{2}}x,\quad e_2(x):=\sqrt{\frac{5}{8}}(3x^2-1), \end{equation} in with

e0 <- function(x) {1/(sqrt(2))}
e1 <- function(x) {sqrt(3/2)*x}
e2 <- function(x) { sqrt(5/8)*(3*x^2-1)}

Note now, that by C9.28 in EH, we may for \(Y_0,Y_1,Y_2\sim\mathcal{N}(0,1)\) write any regular normal distributed random variable \(X\) on \(\mathscr{P}_2\) with precision \(\mathopen{}\left\langle \cdot, \cdot \right\rangle\mathclose{},\) and center \(\xi\) as the sum \[X:=\sum_{n=0}^{2}{Y_ie_i}+\xi.\] Implementing this in we get

xi <- c(0,0,0)
Y_0 <- rnorm(1)
Y_1 <- rnorm(1)
Y_2 <- rnorm(1)

X <- function(x) {Y_0*e0(x)+Y_1*e1(x)+Y_2*e2(x) + xi}

We may also plot the simulated function;

plot(X,-1,1)
## Warning in Y_0 * e0(x) + Y_1 * e1(x) + Y_2 * e2(x) + xi: longer object length is
## not a multiple of shorter object length

4.

We continue from subproblem 3, but before we do, we will shorten the construction from 3; Note we may construct a function that simulated our desired function as follows;

f <- function(x) {
   temp <- rnorm(3)
   X <- temp[1]*e0(x)+temp[2]*e1(x)+temp[3]*e2(x)
   return(X)
}

As such, we may do the desired simulation as follows

plot(0,0, type="n", xlim=c(-1,1), ylim=c(-3,3), xlab="x", ylab="f(x)")
for (i in 1:25) {
plot(f,-1,1, add=TRUE) }

5.

For \(X\) regularly normally distributed on \(\mathscr{P}_2\) with center \(0\) and precision \(\left\langle \cdot, \cdot \right\rangle\) Defining \(Z:=t(X),\) as

HS 9

1.

2.

3.

4.

5.

6.

7.

HS 10

1.

2.

HS 11

1.

Having already completed EH 10.5 in RMD we may copy the assignment of \(X\) and \(A\) with

A <- matrix(c(1,0,1,-1,1,-1,1,1,1,1), nrow = 5, ncol = 2, byrow = T) #By col is default
X <- c(-0.187,-1.731,-0.184,2.252,1.775)

2.

We note that we may reproduce the calculated estimates for \(\alpha_1,\alpha_2,\sigma\) through lm with

model <- lm(X~A-1)
model.matrix(model)
##   A1 A2
## 1  1  0
## 2  1 -1
## 3  1 -1
## 4  1  1
## 5  1  1
## attr(,"assign")
## [1] 1 1
summary(model)
## 
## Call:
## lm(formula = X ~ A - 1)
## 
## Residuals:
##       1       2       3       4       5 
## -0.5720 -0.6305  0.9165  0.3815 -0.0955 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)  
## A1   0.3850     0.3386   1.137   0.3381  
## A2   1.4855     0.3785   3.924   0.0294 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.757 on 3 degrees of freedom
## Multiple R-squared:  0.8477, Adjusted R-squared:  0.7461 
## F-statistic: 8.347 on 2 and 3 DF,  p-value: 0.05945

Looking through ?model.matrix enlightens us a bit as to the output of model.matrix(model) Note that we find the estimate for \(\alpha_1,\,\hat\alpha_1\) under [A1,Estimate Std.], \(\alpha_2,\,\hat\alpha_2\) under [A2,Estimate Std.] and the estimate for \(\sigma,\,???\) under Residual standard error We may extract these three estimates with the below commands, noting that

model$coefficients[[1]] #alphahat_1
## [1] 0.385
model$coefficients[[2]] #alphahat_2
## [1] 1.4855
summary(model)$sigma #sigmatilde
## [1] 0.7570445
sigmatilde2 <- (summary(model)$sigma)^2 #sigmatilde^2
## [1] 0.5731163

3.

Having checked ?vcov, note the output of vcov and solve(t(A)A)-commands;

vcov(model)
##           A1        A2
## A1 0.1146233 0.0000000
## A2 0.0000000 0.1432791
solve(t(A)%*%A)
##      [,1] [,2]
## [1,]  0.2 0.00
## [2,]  0.0 0.25

On an inkling that there “might” be a connection between sigmatilde,vcov(model),solve(t(A)%*%A) note that

solve(t(A)%*%A)*sigmatilde2 #Fits rather well with vcov(model)
##           [,1]      [,2]
## [1,] 0.1146233 0.0000000
## [2,] 0.0000000 0.1432791
Referring back to EH 10.5 and the distribution of the MLE mean \(\hat\alpha\sim\mathcal{N}\mathopen{}\left({{\alpha},{\tilde\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}}}\right)\mathclose{},\) this isn’t entirely surprising, when noting the fact that vcov(model) - by ?vcov itself describes its functionality as;
Returns the variance-covariance matrix of the main parameters of a fitted model object. The “main” parameters of model correspond to those returned by coef, and typically do not contain a nuisance scale parameter (sigma).

4.

Referring back to HS2 in which the extraction of standard errors for estimates based on some linear regression was first carried out, turns out to give us clues as to the connections.

sealpha_1 <- sqrt(diag(vcov(model)))[[1]]
## [1] 0.3385606
sealpha_2 <- sqrt(diag(vcov(model)))[[2]]
## [1] 0.3785222

Note then that \[ SE_{\hat\alpha_1}=\sqrt{\underset{\text{vcov(model)[1,1]}}{\underbrace{\tilde\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}_{11}}}}\\ SE_{\hat\alpha_2}=\sqrt{\underset{\text{vcov(model)[2,2]}}{\underbrace{\tilde\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}_{22}}}} \]

HS 12

1.

2.

HS 13

Note that we are running seed 314 throughout this RMD document.

1.

We may read the data into and define the requested variables with

paddy <- read.csv("paddy.txt", sep="") 
head(paddy)
##   days yield
## 1   16  2508
## 2   18  2518
## 3   20  3304
## 4   22  3423
## 5   24  3057
## 6   26  3190
days <- paddy$days
yield <- paddy$yield
daysSqr <- days^2
kvadreg <- lm(yield ~days+daysSqr)
summary(kvadreg)
## 
## Call:
## lm(formula = yield ~ days + daysSqr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -303.96 -118.11   13.86  115.67  319.06 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1070.3977   617.2527  -1.734    0.107    
## days          293.4829    42.1776   6.958 9.94e-06 ***
## daysSqr        -4.5358     0.6744  -6.726 1.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203.9 on 13 degrees of freedom
## Multiple R-squared:  0.7942, Adjusted R-squared:  0.7625 
## F-statistic: 25.08 on 2 and 13 DF,  p-value: 3.452e-05

we have made made the regression model as written in EH (11.15). Note that we may collect estimates of \(\hat\alpha,\,\hat\beta\) and \(\hat\gamma\) with the code

alphah <- kvadreg$coefficients[[1]]
## [1] -1070.398
betah <- kvadreg$coefficients[[2]]
## [1] 293.4829
gammah <- kvadreg$coefficients[[3]]
## [1] -4.535802
tsigma2 <- (summary(kvadreg)$sigma)^2
## [1] 41568.34

and that these match the estimates reported in EH Example 11.10.

2.

We will be plotting the standard residuals against the estimated (fitted) values for yield, in the same way we did (eg.) in HS2;

fit <- fitted(kvadreg)
rst <- rstandard(kvadreg)
plot(fit, rst, main = "(Estimate, Std. Res.)-plot for kvadreg model on paddy", xlab = "Estimated (fitted) yield-values (kg/ha)", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) ) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
abline(0,0)

Looking at the above plot, we note that if the assumption that the variance of the yield data is independent of the mean is true, we would expect that the dispersion of points at region of fitted yield values would be akin to the dispersion in any other region. If the “linearity” assumption of our model were to be true, we would expect the points to be relatively equally dispersed as \(\mathcal{N}(0,1)\) on both sides of \(0\) across the fitted values. - ie. that the standardized residuals are themselves rather \(\mathcal{N}(0,1)\) distributed.

The plot looks rather respectable on both these fronts, as 1. The (std.) residual values fall within the interval \(\left({-1.5,1.5}\right),\) as would be expected of a standard normal distribution. 2. No wierd “trumpet,” “quadratic” or other wierd distortions to the std. residuals values are present, they seem to keep rather constantly dispersed around zero within any fitted yield region.

3.

We may run the requested commands in , simulating \(16\) new datapoints from the normal distribution with mean and variance parameters defined as the empirical parameters of the original data. The implementation;

xi0 <- fitted(kvadreg)
sigma0 <- summary(kvadreg)$sigma
## [1] 203.8831
simYield <- rnorm(16, xi0, sigma0)

We note that the simulated scatterplot retains a lot of the characteristics of the original, though it is definitely different to the original.

4.

As we did with the original data, we will just as well fit a quadratic regression model to the simulated data.

yield2 <- simYield
kvadreg2 <- lm(yield2 ~ days+daysSqr)
alphah2 <- kvadreg2$coefficients[[1]]
## [1] -868.701
betah2 <- kvadreg2$coefficients[[2]]
## [1] 279.8275
gammah2 <- kvadreg2$coefficients[[3]]
## [1] -4.365983
tsigma2_2 <- (summary(kvadreg2)$sigma)^2
## [1] 68692.48

We may also create a new std. residual plot for the simulated data

fit2 <- fitted(kvadreg2)
rst2 <- rstandard(kvadreg2)
plot(fit2, rst2, main = "(Estimate, Std. Res.)-plot for kvadreg2 model on paddy", xlab = "Estimated (fitted) yield-values (kg/ha)", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(rst2))), max(3.2,max(abs(rst2)))) ) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
abline(0,0)

Which once again doesn’t seem to have much of anything out of the ordinary.

5.

We may define a simulation factory function Sim. Do also note the functionality of the par function with ?par

Conclusions….

6.

The estimates can be obtained with the below code.

coef(kvadreg)
##  (Intercept)         days      daysSqr 
## -1070.397689   293.482948    -4.535802
est <- as.numeric(coef(kvadreg))
## [1] -1070.397689   293.482948    -4.535802

Observe that we are dealing with the model \[ Yield_i=\gamma\cdot daysSqr_i+\beta\cdot days_i+\alpha\equiv\gamma\cdot days_i^2+\beta\cdot days_i+\alpha \] and as such we know from elementary mathematics that the stationary point of \(Yield,\,T_{Yield}\equiv\mathopen{}\left({optDay,optYield}\right)\mathclose{}\) may be found at \[ T_{Yield}\equiv\mathopen{}\left({optDay,optYield}\right)\mathclose{}=\mathopen{}\left({\frac{-\beta}{2\gamma},\frac{-\mathopen{}\left({\beta^2-4\gamma\alpha}\right)\mathclose{}}{4\gamma}}\right)\mathclose{} \] As such we may estimate the location of the stationary (toppoint) point with \[ \hat T_{Yield}\equiv\mathopen{}\left({\frac{-\hat\beta}{2\hat\gamma},\frac{-\mathopen{}\left({\hat\beta^2-4\hat\gamma\hat\alpha}\right)\mathclose{}}{4\hat\gamma}}\right)\mathclose{} \] in with

optDay <- -est[2]/(2*est[3])
## [1] 32.35183
optYield <- - (est[2]^2 - 4*est[1]*est[3]) / (4* est[3])
## [1] 3676.957

7.

The time of harvest days cannot be written as an affine transformation of our parameters as was the premise in Chapter 9 and 10. Specifically it will not be possible to find a matrix \(Q\) such that \(Edays_i=Q\begin{pmatrix}{\alpha\\\beta\\\gamma}\end{pmatrix},\) as isolation of days in the model would require the use of squaring and square roots (decidedly non-affine) in accordance with the quadratic formula.

8.

We may repeat the tasks of subproblem 6 with the data previously simulated in simYield

estSim <- as.numeric(coef(kvadreg2))
## [1] -868.700973  279.827547   -4.365983
optDaySim <- -estSim[2]/(2*estSim[3])
optYieldSim <- - (estSim[2]^2 - 4*estSim[1]*estSim[3]) / (4*estSim[3])
c(optDaySim,optYieldSim)
## [1]   32.04634 3615.02303

9.

We may repeat the above simulation \(1000\) times with

n <- 10^3
resSim <- matrix(NA, n, 2)   # Initiation of a matrix
  for (i in 1:n) {
  simYield <- rnorm(16, xi0, sigma0)
  kvadregSim <- lm(simYield ~ paddy$days + daysSqr)
  estSim <- as.numeric(coef(kvadregSim))
  optDaySim <- -estSim[2]/2/estSim[3]
  optYieldSim <- - (estSim[2]^2 - 4*estSim[1]*estSim[3]) / 4 / estSim[3]
  resSim[i,1] <- optDaySim
  resSim[i,2] <- optYieldSim
}

10.

We may draw histograms with

par(mfrow=c(2,1))
hist(resSim[,1], main="optDay", breaks = 30)
hist(resSim[,2], main="optYield", breaks = 30)

It doesn’t seem unreasonable to assume that a normal distribution could fit the data rather well;

fDay <- function(x) dnorm(x, mean = mean(resSim[,1]) , sd = sd(resSim[,1])) #normalfordelingslinien
fYield <- function(x) dnorm(x,mean=mean(resSim[,2]), sd = sd(resSim[,2]))


par(mfrow=c(2,1))
hist(resSim[,1], main="optDay", prob=T); plot(fDay, 30,35, add=T, col="blue")
hist(resSim[,2], main="optYield", prob=T); plot(fYield, 3400,3900,add=T, col="hot pink")

11.

One way to get approximate \(95\%\) confidence intervals for each of optDay,optYield is to use the quantile function (see ?quantile);

quantile((resSim[,1]),c(0.025,0.975))
##     2.5%    97.5% 
## 31.27306 33.85067
quantile((resSim[,2]),c(0.025,0.975))
##     2.5%    97.5% 
## 3536.514 3827.813

12.

Based on 16 simulated outcomes, we will be doing a “linear, linear-regression” in order to point out problems with the mean that arise in this regard, when doing visual model-control with the likes of residual plots.

simYield <- rnorm(16, xi0, sigma0)
linRegSim <- lm(simYield ~ paddy$days)
linresSim <- rstandard(linRegSim)
linfitSim <- fitted(linRegSim)
plot(linfitSim, linresSim, main = "(Estimate, Std. Res.)-plot for linear linreg model on simulated data", xlab = "Estimated (fitted) Yield", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(linresSim))), max(3.2,max(abs(linresSim)))) );abline(0,0)

Note the clear parabolic shape of the point of the std.-residual plot above. While the points, as they should, fall within \((-2,2)\) the manner in which they do so via the aforementioned parabolic shape suggests that an assumption of the linear relation

\[ EsimYield_i=\zeta+\varphi\cdot days_i \] is rather implausible.

We may, just as in subproblem 5 repeat the process of simulation;

SimNPlot<-function(i) {
   simYield <- rnorm(16, xi0, sigma0)
   linRegSim <- lm(simYield ~ days)
   fit <- fitted(linRegSim)
   rst <- rstandard(linRegSim)
   plot(fit, rst, xlim=c(3000,3600), main = "", xlab = "", ylab = "", ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) ) 
   #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
   abline(0,0)
}

par(mfrow = c(3,3), mar=c(0.55,0.55,0.55,0.55))
for (i in 1:9) SimNPlot(i)

Looking at the different simulated plots, notice that a fair bit of variety appears, though non more so, that we in each case could say that the linearity assumption is improbable to hold true.

13.

Below, we will be using the suggested commands to explore the result of fudging with the variance when doing std. residual plots. Note that one may examine ?points for further details regarding this command.

newSD <- 25*(15-abs(paddy$days-31))
newSD
##  [1]   0  50 100 150 200 250 300 350 350 300 250 200 150 100  50   0
simYield <- rnorm(16, xi0, newSD)
plot(paddy$days, simYield)
points(paddy$days, xi0, type="l")

Above, we are manipulating the standard deviation, as shown when calling newSD, that assigns a symmetrically high deviation to the “middle simulations” of the simulated \(16\)-datapoints - ie. we assign the highest deviation to the \(8,9\)th draw from the normal distribution, and then progressively and symmetrically lower deviation towards the “edges.”

The result of the above procedure when plotting simYield against days is that as days are ordered, the i’th Yield-simulation (the middle of which have now been simulated with greater variance) will be assumed to be connected with the corresponding i’th day.

Consequently for the “quadratic” standardized-residual plot

kvadregSim <- lm(simYield ~ paddy$days + daysSqr)
resSim <- rstandard(kvadregSim)
fitSim <- fitted(kvadregSim)
plot(fitSim, resSim, main = "(Estimate, Std. Res.)-plot", xlab = "Estimated (fitted) Yield", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(linresSim))), max(3.2,max(abs(linresSim)))) );abline(h=0,lty=2)

Note that the highest fitted Yield values seem to be significantly more spread out than the rest.

We may simulate more datasets to explore the reproducibility of this result

par(mfrow = c(3,3), mar=c(0.55,0.55,0.55,0.55))
for (i in 1:9){
  simYield <- rnorm(16, xi0, newSD)
  kvadregSim <- lm(simYield ~ paddy$days + daysSqr)
  resSim <- rstandard(kvadregSim)
  fitSim <- fitted(kvadregSim)
  plot(fitSim,resSim, xlim=c(2400,4000), main = "", xlab = "", ylab = "", ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst))))) ;abline(h=0,lty=2)
}

The simulated datasets seem rather “trumpet shaped” as in it seems that the standard residuals posses an increased variance toward the higher Yield values. - This is in accordance with the procedure by which data was simulated, as we have simulated larger standard deviation for the middle datapoints, which, in accordance with the quadratic linear regression model, also tend to have larger value - ie. we would expect to see larger variance for the larger values of the dataset. This in return will then transfer onto the residuals, which we, in keeping with plots above, are expected to be more spread out, towards the larger Yield values.

HS 14

1.

If one were primarily to be interested in predicting fat percentages through “the use of the other variables,” one might try a linear regression model of the form;

For \(Fat_1,\ldots Fat_N\) independent normally distributed with variance homogeneity of \(\sigma^2\) and mean \(EFat_i=\alpha+\beta\cdot Triceps_i+\gamma\cdot Thigh_i+\delta\cdot Midarm_i,\,\,i=1,\ldots,N.\)

2.

We may start off with importing the relevant dataset;

bodyfat <- read_table2("bodyfat.txt")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   Fat = col_double(),
##   Triceps = col_double(),
##   Thigh = col_double(),
##   Midarm = col_double()
## )
head(bodyfat)
## # A tibble: 6 x 4
##     Fat Triceps Thigh Midarm
##   <dbl>   <dbl> <dbl>  <dbl>
## 1  11.9    19.5  43.1   29.1
## 2  22.8    24.7  49.8   28.2
## 3  18.7    30.7  51.9   37  
## 4  20.1    29.8  54.3   31.1
## 5  12.9    19.1  42.2   30.9
## 6  21.7    25.6  53.9   23.7

as well as using the commands

plot(bodyfat)

cor(bodyfat)
##               Fat   Triceps     Thigh    Midarm
## Fat     1.0000000 0.8432654 0.8780896 0.1424440
## Triceps 0.8432654 1.0000000 0.9238425 0.4577772
## Thigh   0.8780896 0.9238425 1.0000000 0.0846675
## Midarm  0.1424440 0.4577772 0.0846675 1.0000000

Looking at ?cor one might conclude that cor generates a correlation matrix for the dataset. The “wierd” appearance of the above plot is caused by the fact that we have a total of four variables at play in the dataset, such that there are \(4\cdot 4 = 16\) different ways of plotting one plot against another. Of the \(16\) however, four include plotting of the variables against themself, which is rather silly, and thus the resulting plot is created by .

3.

We will be fitting the model from subproblem 1 using lm

m1 <- lm(Fat ~ Triceps + Thigh + Midarm, data = bodyfat)

We might also do a visual modelcontrol through the use of a standardized residual plot;

rst <- rstandard(m1)
fit <- fitted(m1)
plot(fit, rst, main = "(Estimate, Std. Res.)-plot", xlab = "Estimated (fitted) Fat Percentage", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) );abline(h=0,lty=2)

Which looks rather respectable on all fronts.

4.

We may look at the summary(m1) object;

summary(m1)
## 
## Call:
## lm(formula = Fat ~ Triceps + Thigh + Midarm, data = bodyfat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7263 -1.6111  0.3923  1.4656  4.1277 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  117.085     99.782   1.173    0.258
## Triceps        4.334      3.016   1.437    0.170
## Thigh         -2.857      2.582  -1.106    0.285
## Midarm        -2.186      1.595  -1.370    0.190
## 
## Residual standard error: 2.48 on 16 degrees of freedom
## Multiple R-squared:  0.8014, Adjusted R-squared:  0.7641 
## F-statistic: 21.52 on 3 and 16 DF,  p-value: 7.343e-06

Amongst the coefficient parameters we find the t-static values. There are of the form \[ t_{\hat\zeta}=\frac{\hat\zeta-\zeta_0}{se(\hat\zeta)}, \] for some corresponding parameter \(\zeta\) of our model. By default sets \(\zeta_0=0,\) leaving the t values that appear, resulting from division of the estimate for the parameter with its standard error, ie with \(\zeta_0=0\) be have \(t_{\hat\zeta}=\frac{\hat\zeta}{se(\hat\zeta)}.\)

With these t values the corresponding Pr(>|t|)-values are the symmetric tail probabilities (p values) of getting t values of greater magnitude than \(t_{\hat\zeta}\) in a student t - distribution with m1$df.residual=16 degrees of freedom, ie. p values: probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct, such that small values are critical to the hypothesis. We may formalize the writing of the p-values; For \(T\sim t_{16}\), the to \(\zeta\) corresponding Pr(>|t|) value, is \[ P(|T|>|t|)\overset{\text{symmetry}}{=}2P(T>|t|). \] As an example of a manual calculation of one of the p-values being carried out, we could approach the hypothesis; \[ H:\,\alpha = 0 \] which by the model corresponds to the hypothesis that the intercept is \(0.\)

(degfree <- m1$df.residual)
## [1] 16
hatalpha <- coef(summary(m1))[1,1]
## [1] 117.0847
sealpha <- coef(summary(m1))[1,2]
## [1] 99.7824
t_alpha <- coef(summary(m1))[1,3]
## [1] 1.1734
pt_alpha <- 2*pt(t_alpha, degfree, lower.tail = F)
## [1] 0.2578078
pt_alpha-coef(summary(m1))[1,4] #Is there a difference between our p-value, and summary's? <- nope.
## [1] 0

Looking at the p-value vector

pvalvec <- coef(summary(m1))[,4]
pvalvec
## (Intercept)     Triceps       Thigh      Midarm 
##   0.2578078   0.1699111   0.2848944   0.1895628

we might conclude that while there is not a lot of support for the various “marginal” null hypotheses, the p values are at the same time not so critical (small) that we may immediately dismiss any of the null hypotheses.

5.

After looking through ?anova, we may use the recommended commands

m2 <- lm(Fat ~ Midarm, data = bodyfat)
anova(m2,m1)
## Analysis of Variance Table
## 
## Model 1: Fat ~ Midarm
## Model 2: Fat ~ Triceps + Thigh + Midarm
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     18 485.34                                  
## 2     16  98.40  2    386.93 31.456 2.856e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion…

6.

7.

8.

9.

HS 15

We may import the dataset;

engel <- read_table2("engel.txt") #read_table2 = whitespace delimiter
## 
## -- Column specification --------------------------------------------------------
## cols(
##   income = col_double(),
##   foodexp = col_double()
## )

1.

Having registered both the income and the amount spent on food for different households, one might imagine that the amount spent on food could within some sort of causal structure be seen as a function of the income of the corresponding household. We may hold the hypothesis that a household with more income might translate to a household with more disposable income, and that a household with more disposable income would then dispence more of this towards food. We might create the corresponding scatterplot

foodexp <- engel$foodexp #Note that 'attach' only has scope of the current R-chunck
income <- engel$income #Note that 'attach' only has scope of the current R-chunck
p <- ggplot(data = engel)
p_1 <- geom_point(mapping = aes(x=income, y = foodexp))
p+p_1

It seem like there is a central body of observations that could have some sort of underlying linear trend going on, though it appears as if the observations are more spread out the higher the income within the central body. There also seem to be three outliers from this central body of observations, that are probably going to have a fair bit of sway in deriving a linear fit.

2.

By subproblem 1 it doesn’t seem totally unreasonable to write the statistical model;
Let \(foodexp_1,\ldots foodexp_{235}\) independent normally distributed with variance homogeneity of \(\sigma^2\) and mean \(Efoodexp_i=\alpha+\beta\cdot Income_i+\epsilon_i,\,\,\epsilon_i\sim\mathcal{N}\mathopen{}\left({{0},{\sigma^2}}\right)\mathclose{}\,\,i=1,\ldots,235.\)

We may fit this model in with

model <- lm(foodexp ~ income, data = engel)

We might then do a visual model control through a standardized residual plot, and a QQ-plot;

fit <- fitted(model)
rst <- rstandard(model)
library(gridExtra)
p_2 <- qplot(fit, rst, main = "(Estimate, Std. Res.)-plot for linreg model on engel", xlab = "Estimated (fitted) foodexpence", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) )+geom_hline(yintercept = 0) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
p_3 <- p + geom_qq() + geom_qq_line() + aes(sample = rst)
grid.arrange(p_2,p_3, ncol = 2)

From the residual plot we might confirm our suspicions regarding a non-constant variance, as we see the variance seemingly growing with increases in mean such that the variance homogeneity assumption seems unreasonable. There also seems to be a signifigant number of tail-outliers from the central trend of the qq-plot, we might also conclude that the normality-assumption regarding foodexp is also unreasonable.

3.

With the theory of EH Chapter 11.3 at our back, we may log-transform both variables;

lfood <- log(foodexp)
lincome <- log(income)
Our new statistical model will be for
Let \(lfood_1,\ldots lfood_{235}\) independent normally distributed with variance homogeneity of \(\sigma^2\) and mean \(Elfood_i=\alpha_l+\beta_l\cdot lincome_i+\epsilon_i,\,\,\epsilon_i\sim\mathcal{N}\mathopen{}\left({{0},{\sigma^2}}\right)\mathclose{}\,\,i=1,\ldots,235.\)

One might get the idea that a log-transformation of the “original model,” might be prudent, as the log-transform will serve to stabilize the variability of especially increasing variance values.

We will once again fit a linear regression model noting that the “linear” in linear regression relates to how the parameters appear in the model (thus “linear” should really be interpreted as “affine”). In ;

lmodel <- lm(lincome ~ lfood)
lfit <- fitted(lmodel)
lrst <- rstandard(lmodel)
p_4 <- qplot(lfit, lrst, main = "(Estimate, Std. Res.)-plot for log-linreg model on engel", xlab = "Estimated (fitted) lfood", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(lrst))), max(3.2,max(abs(lrst)))) )+geom_hline(yintercept = 0) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
p_5 <- p + geom_qq() + geom_qq_line() + aes(sample = lrst)
grid.arrange(p_4,p_5, ncol = 2)

The log transformation seem to have done some good, as the residual plot seem to hold up to both the linearity assumption of the means, and to the variance homogeneity assumptions, with only a few outliers. The normality assumption also seems like a rather reasonable approximation, once again with only a few outliers.

4.

We may collect the major parameters of the model with

lalpha <- lmodel$coefficients[[1]]
## [1] 0.2249831
lbeta <- lmodel$coefficients[[2]]
## [1] 1.032704
ltsigma2 <- (summary(lmodel)$sigma)^2
## [1] 0.02257692

5.

For the model in subproblem 3

6.

7.

8.

9.

HS 16

We may start by importing the dataset

pillbug <- read_table2("pillbug.txt", 
    col_types = cols(group = col_factor(levels = c("Light", 
        "Moisture", "Control"))))
head(pillbug)
## # A tibble: 6 x 2
##    time group
##   <dbl> <fct>
## 1    23 Light
## 2    12 Light
## 3    29 Light
## 4    12 Light
## 5     5 Light
## 6    47 Light

1.

By EH page 499, the point of a one-sided analysis of variance, is the test of the model with the assumption that data with the same labels are to have come from the same normal distribution, while there might be a difference in mean between the different labels.\ As such, we may make the experiment in a variety of different ways. The simplest experiment in which we want to test whether exposure to light or moisture will affect time, is analysing whether any of the treatments have an impact at all, through the constant factor system \(L_1.\) In terms of the one-dimensional analysis of variance, this results from the \(L_1\) assumption that all the groups will be distributed the same way, but as one of the treatment is ‘no treatment,’ this will be the hypothesis that there that none of the treatments are effective, ie. that \(\xi\equiv\mathopen{}\left({\xi_i}\right)\mathclose{}_{i\in I}\in L_1,\) such that \(X_i\sim\mathcal{N}\mathopen{}\left({{\xi},{\sigma^2}}\right)\mathclose{} \forall i\in I.\) \ Note especially that we are dealing with a treatment factor \(t:I\rightarrow T\) with \(T=\mathopen{}\left\{{No treatment, light, moisture}\right\}\mathclose{}.\) As such we would be looking at the model that in constructed via the following lm-commands, with the thereon following designmatrix.

lm_L1 <- lm(pillbug$time ~ 1, data = pillbug)
model.matrix(pillbug$time ~ 1, data = pillbug)
##    (Intercept)
## 1            1
## 2            1
## 3            1
## 4            1
## 5            1
## 6            1
## 7            1
## 8            1
## 9            1
## 10           1
## 11           1
## 12           1
## 13           1
## 14           1
## 15           1
## 16           1
## 17           1
## 18           1
## 19           1
## 20           1
## 21           1
## 22           1
## 23           1
## 24           1
## 25           1
## 26           1
## 27           1
## 28           1
## 29           1
## 30           1
## 31           1
## 32           1
## 33           1
## 34           1
## 35           1
## 36           1
## 37           1
## 38           1
## 39           1
## 40           1
## 41           1
## 42           1
## 43           1
## 44           1
## 45           1
## 46           1
## 47           1
## 48           1
## 49           1
## 50           1
## 51           1
## 52           1
## 53           1
## 54           1
## 55           1
## 56           1
## 57           1
## 58           1
## 59           1
## 60           1
## attr(,"assign")
## [1] 0

Alternatively, we may look at the one-sided variance through the lens of the hypotheses that there really might be a difference in effect of the different treatments. When dealing with the ‘treatment’ factor group we may therefore construct the analysis in as is done in the following code chunk. Note in particular that the first uncommented line is present in order to change the intercept reference from being that of the alphabetically first factor, to Control - One might compare the derived design matrices with and without this relevelling. See also reorder() and factor(). Note also that multiple use of relevel might do the required permuting of the levels as is seen in the commented line.

#pillbug$group <- relevel(pillbug$group, ref = "Moisture") #relevel M before C, gives order; C, M, L
pillbug$group <- relevel(pillbug$group, ref = "Control")
lm_LT <- lm(time ~ group, data = pillbug)
model.matrix(time ~ group, data = pillbug)
##    (Intercept) groupLight groupMoisture
## 1            1          1             0
## 2            1          1             0
## 3            1          1             0
## 4            1          1             0
## 5            1          1             0
## 6            1          1             0
## 7            1          1             0
## 8            1          1             0
## 9            1          1             0
## 10           1          1             0
## 11           1          1             0
## 12           1          1             0
## 13           1          1             0
## 14           1          1             0
## 15           1          1             0
## 16           1          1             0
## 17           1          1             0
## 18           1          1             0
## 19           1          1             0
## 20           1          1             0
## 21           1          0             1
## 22           1          0             1
## 23           1          0             1
## 24           1          0             1
## 25           1          0             1
## 26           1          0             1
## 27           1          0             1
## 28           1          0             1
## 29           1          0             1
## 30           1          0             1
## 31           1          0             1
## 32           1          0             1
## 33           1          0             1
## 34           1          0             1
## 35           1          0             1
## 36           1          0             1
## 37           1          0             1
## 38           1          0             1
## 39           1          0             1
## 40           1          0             1
## 41           1          0             0
## 42           1          0             0
## 43           1          0             0
## 44           1          0             0
## 45           1          0             0
## 46           1          0             0
## 47           1          0             0
## 48           1          0             0
## 49           1          0             0
## 50           1          0             0
## 51           1          0             0
## 52           1          0             0
## 53           1          0             0
## 54           1          0             0
## 55           1          0             0
## 56           1          0             0
## 57           1          0             0
## 58           1          0             0
## 59           1          0             0
## 60           1          0             0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Alternatively, one might look at the importation of the data set into and specify the data in the following manner:

pillbug <- read_table2("pillbug.txt", 
    col_types = cols(group = col_factor(levels = c("Control", "Light", 
        "Moisture"))))

2.

As requested, we will be trying the command

boxplot(time~group, data=pillbug)

Having seen the result, we might investigate its origin, and the functionality of the boxplot command with ?boxplot. As such, the whiskers mark the minimum and maximum of the data excluding ‘outliers.’ The main body of the boxplot is the Inter Quantile Range (IQR) of the \(25^{th}, 50^{th}, 75^{th}\) percentile quantile, noting that the \(50^{th}\) is the median, and is marked by the filled line, in each of the groups. We might do the similar plot in ggplot2 with

ggplot(pillbug, aes(x=group, y=time, color=group)) +
  geom_boxplot() + theme(legend.position = "none")

\ We might hereafter use our above constructed lm_LT model, and make the attached residual plot;

fitLT <- fitted(lm_LT)
rstLT <- rstandard(lm_LT)
Stdresplot <- function(model, main = paste("(Estimate, Std. Res.)-plot of", deparse(substitute(model))), ylab ="Standardized residuals", ...) {
 fit <- fitted(model)
 rst <- rstandard(model)
 qplot(fit, rst, main = main, ylab = ylab, ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) )+geom_hline(yintercept = 0) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
}
QQplotdraw <- function(model, main = paste("Normal QQ-plot of", deparse(substitute(model))), xlab = "Theoretical Quantiles", ylab ="Sample Quantiles", ...) {
   rst <- rstandard(model)
   #dataname <- getCall(lm_LT)$data
   ggplot(data = eval(getCall(model)$data), main = main, xlab = xlab, ylab = ylab) + geom_qq() + geom_qq_line() + aes(sample = rst)
} #main, xlab, ylab call do not work for some reason
p1 <- Stdresplot(lm_LT)
p2 <- QQplotdraw(lm_LT)
library(gridExtra)
grid.arrange(p1,p2, ncol = 2)

We see from the standard residualplot, the linearity assumption of the model seems very possible, while the cluster having lower times (which turns out happen to be the cases of light) seems to have significantly lower variance than the other cluster with larger times. - We might therefore question the variance homogeneity of the data. \ Looking at the QQplot we see a definite offset from the line with intersect 0 and slope 1, that is to be expected of the matching of the theoretical \(\mathcal{N}\mathopen{}\left({{0},{1}}\right)\mathclose{}\) quantiles with the sample quantiles of the standard residuals.

3.

As requested, we may log-transform time in our linear regression model. Doing this in with

lm_logLT <- lm(I(log(time)) ~ group, data = pillbug)
StdresQQPlot <- function(model,...) {
   p1 <- Stdresplot(model,...)
   p2 <- QQplotdraw(model,...)
   library(gridExtra)
   grid.arrange(p1,p2, ncol = 2)
}
StdresQQPlot(lm_logLT)

ggplot(pillbug, aes(x=group, y=log(time), color=group)) +
  geom_boxplot() + theme(legend.position = "none")

we see that the QQplot is still not entirely relateable, but that the variance homogeneity now seems significantly more appropriate. We thus see that the logarithm has a variance stabilising effect when applied to time.

4.

The model of the hypothesis that neither moisture or light cause any change in running time, is the model that the mean vector \(\xi\equiv\mathopen{}\left({\xi_i}\right)\mathclose{}_{i\in I}\in L_1,\) i.e. that there is no differnece in distribution of the data marked with each of the labels, as one of the labels literaly is ‘no treatment.’ As such note the above definition of lm_L1 in , that we, according to subproblem 3 might log-transform the time of, in order to better fit our model assumptions of variance homogeneity. Note that the anova test performed, is of whether it makes sense to divide up the data amongst the groups as is done in the \(L_T\) hypothesis of assuming that the distribution within the groups are similar, and whether we might reduce this assumption, to there being no difference amongst the groups;

lm_logL1 <- lm(I(log(time))~1, data = pillbug)
(an <- anova(lm_logL1,lm_logLT)) #Note the ordering of the models, such that the smallest modell is first.
## Analysis of Variance Table
## 
## Model 1: I(log(time)) ~ 1
## Model 2: I(log(time)) ~ group
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     59 76.677                                  
## 2     57 23.567  2     53.11 64.227 2.498e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Doing the analysis of variance between the model grouped by the individual treatments and \(L_1\) (the second row), note that we have an \(F\)-test with (2,57) degrees of freedom. The \(2\) in (2,57) refers to the fact that we from the grouping model, to the collected model have a reduction of groups of two, as we go from having three different treatment groups (no treatment, light and moisture) to having just one (L_1: all the data is the same group).

With a \(F\) value of 64.227 corresponding to a p-value of \(\cong2.5\cdot10^{-15},\) we may with great ease lay the hypothesis that these pillbugs have the same running time no matter the treatment to rest, as this from the data seems exceedingly unlikely.

5.

We may as requested report the parameters of the model in subquestion 3 (with control intercept) as

summary(lm_logLT)
## 
## Call:
## lm(formula = I(log(time)) ~ group, data = pillbug)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5454 -0.2037  0.1510  0.4448  0.8602 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.99207    0.14378  34.720  < 2e-16 ***
## groupLight    -2.00211    0.20334  -9.846 6.61e-14 ***
## groupMoisture -0.01266    0.20334  -0.062    0.951    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.643 on 57 degrees of freedom
## Multiple R-squared:  0.6926, Adjusted R-squared:  0.6819 
## F-statistic: 64.23 on 2 and 57 DF,  p-value: 2.498e-15

Note especially that from this that the estimate for groupMoisture is roughly zero, such that the data doesn’t change much with the moisture. Note also that it is reported that the estimate for the change of the light exposure group when compared to the control is a change of logtime of roughly \(-2\) - some that makes a lot of sense when looking at the boxplots of control and light in subproblem 3. Finally, the anova conclusion reached in the subquestion 4 is present at the bottom of the summary, in the sense that the F static, along with degrees of freedom and p-value are present.

6.

confint(lm_logLT)
##                    2.5 %     97.5 %
## (Intercept)    4.7041572  5.2799872
## groupLight    -2.4092865 -1.5949399
## groupMoisture -0.4198323  0.3945143

We see from the above, that as \(0\) is not contained in the \(95\%\) for light, and could thus be said to be significant in this sense. The opposite can be said for groupMoisture, for which zero is contained in the interval.

HS 17

1.

2.

3.

4.

5.

HS 18

1.

2.

3.


Extra Problems

Extra 5

Extra 6

Extra 8

Extra on Regression and confidence intervals

NRH RMD (In Danish)

Det dokument gennemgår analysen fra EH eksempel 11.13 i R. Dokumentet viser nogle R-tekniske løsninger, og kommentarerne omkring selve analysen er meget sparsomme, og skal ikke ses som fyldestgørende for hvordan en praktisk dataanalyse skal se ud.

Vi vil bruge diverse funktioner fra Tidyverse, så pakken indlæses først.

library(tidyverse)

Vi indlæser data fra EH eksempel 11.13.

CAPM <- read_csv("CAPM.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   Month = col_character(),
##   r = col_double(),
##   M = col_double(),
##   R = col_double()
## )

Scatter plots af de tre variable mod hinanden.

qplot(r, R, data = CAPM, size = I(3), color = I("blue"))
qplot(M, R, data = CAPM, size = I(3), color = I("blue"))
qplot(r, M, data = CAPM, size = I(3), color = I("blue"))

Vi kan også plotte af de tre variable over tid, som giver god mening for det her datasæt.

qplot(factor(Month, levels = Month), R, data = CAPM, size = I(2), color = I("blue")) + geom_line(aes(group = 1), color = "blue")
qplot(factor(Month, levels = Month), M, data = CAPM, size = I(2), color = I("red")) + geom_line(aes(group = 1), color = "red")
qplot(factor(Month, levels = Month), r, data = CAPM, size = I(2), color = I("violet")) + geom_line(aes(group = 1), color = "violet")

Vi genfinder nu den fittede model fra EH eksempel 11.13.

CAPM_lm <- lm(R ~ M + r, data = CAPM)
summary(CAPM_lm)
## 
## Call:
## lm(formula = R ~ M + r, data = CAPM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7829  -2.6282  -0.4654   3.4771  11.5701 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -10.0837     6.1645  -1.636   0.1203  
## M             0.1556     0.3371   0.462   0.6502  
## r            29.0208    14.4096   2.014   0.0601 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.394 on 17 degrees of freedom
## Multiple R-squared:  0.2035, Adjusted R-squared:  0.1098 
## F-statistic: 2.171 on 2 and 17 DF,  p-value: 0.1446

Vi kan konstruere designmatricen ved “håndkraft” fra data på følgende måde:

A <- matrix(c(rep(1, 20), CAPM$M, CAPM$r), nrow = 20)
A
##       [,1]   [,2]  [,3]
##  [1,]    1  8.549 0.496
##  [2,]    1 -0.173 0.510
##  [3,]    1  4.199 0.532
##  [4,]    1 -0.630 0.569
##  [5,]    1 -0.565 0.543
##  [6,]    1  5.106 0.581
##  [7,]    1  7.012 0.543
##  [8,]    1 -0.346 0.475
##  [9,]    1  6.312 0.370
## [10,]    1  2.949 0.316
## [11,]    1 -1.556 0.340
## [12,]    1  2.432 0.316
## [13,]    1  8.631 0.309
## [14,]    1  0.123 0.315
## [15,]    1 10.174 0.314
## [16,]    1 -4.737 0.350
## [17,]    1  2.498 0.334
## [18,]    1 -5.460 0.346
## [19,]    1  0.921 0.351
## [20,]    1  5.283 0.371

Vi kan nu genfinde estimaterne ved løsning af et lineært ligningssystem.

solve(t(A) %*% A, t(A) %*% CAPM$R)
##             [,1]
## [1,] -10.0837178
## [2,]   0.1555781
## [3,]  29.0207852

Designmatricen kan vi også trække ud af lm-objektet eller konstruere direkte med funktionen model.matrix()

model.matrix(CAPM_lm)
##    (Intercept)      M     r
## 1            1  8.549 0.496
## 2            1 -0.173 0.510
## 3            1  4.199 0.532
## 4            1 -0.630 0.569
## 5            1 -0.565 0.543
## 6            1  5.106 0.581
## 7            1  7.012 0.543
## 8            1 -0.346 0.475
## 9            1  6.312 0.370
## 10           1  2.949 0.316
## 11           1 -1.556 0.340
## 12           1  2.432 0.316
## 13           1  8.631 0.309
## 14           1  0.123 0.315
## 15           1 10.174 0.314
## 16           1 -4.737 0.350
## 17           1  2.498 0.334
## 18           1 -5.460 0.346
## 19           1  0.921 0.351
## 20           1  5.283 0.371
## attr(,"assign")
## [1] 0 1 2
model.matrix(~ M + r, data = CAPM)
##    (Intercept)      M     r
## 1            1  8.549 0.496
## 2            1 -0.173 0.510
## 3            1  4.199 0.532
## 4            1 -0.630 0.569
## 5            1 -0.565 0.543
## 6            1  5.106 0.581
## 7            1  7.012 0.543
## 8            1 -0.346 0.475
## 9            1  6.312 0.370
## 10           1  2.949 0.316
## 11           1 -1.556 0.340
## 12           1  2.432 0.316
## 13           1  8.631 0.309
## 14           1  0.123 0.315
## 15           1 10.174 0.314
## 16           1 -4.737 0.350
## 17           1  2.498 0.334
## 18           1 -5.460 0.346
## 19           1  0.921 0.351
## 20           1  5.283 0.371
## attr(,"assign")
## [1] 0 1 2

Først plotter vi de rå residualer

qplot(fitted(CAPM_lm), residuals(CAPM_lm)) + 
    geom_hline(yintercept = 0) +
    geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

og dernæst residualerne standardiseret med det centrale estimat for spredningen.

sigma_hat <- sqrt(sum(residuals(CAPM_lm)^2) / 17)  # Centralt variansestimate
qplot(fitted(CAPM_lm), residuals(CAPM_lm) / sigma_hat) + 
    geom_hline(yintercept = 0) +
    geom_hline(yintercept = -2, linetype = 2) +
    geom_hline(yintercept = 2, linetype = 2) +
    geom_smooth() +
    ylim(-3, 3)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Endelig plotter vi de standardiserede residualer (røde nedenfor), hvor der altså også er devideret med \(\sqrt{1 - h_{ii}}\), sammenholdt med residualerne ovenfor.

qplot(fitted(CAPM_lm), rstandard(CAPM_lm), color = I("red")) + 
    geom_point(aes(y = residuals(CAPM_lm) / sigma_hat)) + 
    geom_hline(yintercept = 0) +
    geom_hline(yintercept = -2, linetype = 2) +
    geom_hline(yintercept = 2, linetype = 2) +
    geom_smooth() +
    ylim(-3, 3)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Man kan observere, at effekten af division med \(\sqrt{1 - h_{ii}}\) er ganske lille, men alle observationerne flytter udad. Vi kan udtrække \(h_{ii}\) fra lm-objektet med hatvalues(), hvis vi vil.

range(1 / sqrt(1 - hatvalues(CAPM_lm)))
## [1] 1.040627 1.170057

For det her eksempel er det også værd at se på residualerne som funktion af tid, og lave et lag-plot for at se, om der er afhængighed.

qplot(1:20, rstandard(CAPM_lm)) + 
    geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
qplot(rstandard(CAPM_lm)[-20], rstandard(CAPM_lm)[-1]) + 
    geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Lag-plottet indikerer, at fejlene er negativt korrelerede.

Vi kan også lave et QQ-plot for at undersøge normalfordelingsantagelsen. Med kun 20 observationer er det ganske svært at afvise normalfordelingen.

qqnorm(rstandard(CAPM_lm))
qqline(rstandard(CAPM_lm))

Konfidensintervaller

Vi kan til slut genfinde konfidensintervallerne fra EH ved at bruge confint() på lm-objektet. Den omsætter estimater og standard errors til standard konfidensintervaller baseret på den korrekte \(t\)-fordeling.

confint(CAPM_lm)
##                   2.5 %     97.5 %
## (Intercept) -23.0895931  2.9221576
## M            -0.5555732  0.8667294
## r            -1.3808162 59.4223867

Reducerer vi modellen ved at fjerne interceptet fås et noget anderledes estimat, og en reduktion i standard errors for \(\hat{\gamma}\). Det skyldes primært at \(r\) varierer ganske lidt og interceptsøjlen og \(r\) kommer til at være tæt på hinanden. Om de ligefrem er tæt på at være kollineære er til diskussion.

CAPM_lm_0 <- lm(R ~ M + r - 1, data = CAPM)
summary(CAPM_lm_0)
## 
## Call:
## lm(formula = R ~ M + r - 1, data = CAPM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1457  -3.1795  -0.7022   3.3169   9.2088 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## M  0.09853    0.35052   0.281    0.782
## r  6.32644    4.07043   1.554    0.138
## 
## Residual standard error: 6.685 on 18 degrees of freedom
## Multiple R-squared:  0.1799, Adjusted R-squared:  0.08875 
## F-statistic: 1.974 on 2 and 18 DF,  p-value: 0.1678
confint(CAPM_lm_0)
##        2.5 %     97.5 %
## M -0.6378792  0.8349428
## r -2.2252161 14.8780969
CAPM <- read_csv("CAPM.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   Month = col_character(),
##   r = col_double(),
##   M = col_double(),
##   R = col_double()
## )

1.

Our main focus is the CAPM-inspired regression-model from EH example 11.13 of \[ R_i\equiv \alpha + \beta M_i + \gamma r_i + \epsilon_i, \] for \(\epsilon_1,\ldots,\epsilon_N\) iid. \(\sim\mathcal{N}\mathopen{}\left({{0},{\sigma^2}}\right)\mathclose{}\) for some \(\sigma>0,\) such that \(R_1,\ldots,R_N\) are assumed independent normally distributed with variance homogeneity of \(\sigma^2,\) and mean \[ ER_i=\alpha+\beta M_i+\gamma r_i + 0. \] Define \(\theta\equiv\mathopen{}\left({\alpha, \beta, \gamma}\right)\mathclose{}^T\) such that we may find a designmatrix \(A\) such that \[ R_i= A\theta+\epsilon\equiv \begin{pmatrix}{\alpha\\\beta\\\gamma}\end{pmatrix}A+\begin{pmatrix}{\epsilon_1\\\,\vdots\\\epsilon_N}\end{pmatrix}\equiv A_{,1}\alpha + A_{,2}\beta + A_{,3}\gamma + \begin{pmatrix}{\epsilon_1\\\,\vdots\\\epsilon_N}\end{pmatrix} \] Note as \(\epsilon\in M_{N,1},\theta\in M_{3,1}\) that we thus need to have \(A\in M_{N,3}.\) Further, we will necessarily need to have \[ a_{1,j}=1,\quad a_{2,j}=M_j,\quad a_{3,j}=r_j,\quad \text{for } j=1,\ldots,N. \] Consequently, our central assumption of \(R_1,\ldots,R_N\) independent normally distributed with variance homogeneity may thus for \(R:=\mathopen{}\left({R_1,\ldots,R_N}\right)\mathclose{}^T\) be written as \[ R\sim\mathcal{N}\mathopen{}\left({{\begin{pmatrix}{ER_1\\ \,\,\vdots\\ ER_N}\end{pmatrix}},{\sigma^2I}}\right)\mathclose{}\equiv\mathcal{N}\mathopen{}\left({{\begin{pmatrix}{\alpha+\beta M_1+\gamma r_1\\ \quad \quad \quad \vdots\\\alpha+\beta M_N+\gamma r_N}\end{pmatrix}},{\sigma^2I}}\right)\mathclose{}\equiv\mathcal{N}\mathopen{}\left({{A\theta},{\sigma^2I}}\right)\mathclose{}. \] We may define the designmatrix in

N <- nrow(CAPM)
k <- 3
R <- CAPM$R
A <- matrix(c(rep(1,20),CAPM$M,CAPM$r), nrow = N, ncol = k) #bycol=T is default.
A
##       [,1]   [,2]  [,3]
##  [1,]    1  8.549 0.496
##  [2,]    1 -0.173 0.510
##  [3,]    1  4.199 0.532
##  [4,]    1 -0.630 0.569
##  [5,]    1 -0.565 0.543
##  [6,]    1  5.106 0.581
##  [7,]    1  7.012 0.543
##  [8,]    1 -0.346 0.475
##  [9,]    1  6.312 0.370
## [10,]    1  2.949 0.316
## [11,]    1 -1.556 0.340
## [12,]    1  2.432 0.316
## [13,]    1  8.631 0.309
## [14,]    1  0.123 0.315
## [15,]    1 10.174 0.314
## [16,]    1 -4.737 0.350
## [17,]    1  2.498 0.334
## [18,]    1 -5.460 0.346
## [19,]    1  0.921 0.351
## [20,]    1  5.283 0.371
Alternative definition of A in

As \(A\) is very much defined via its columns, we may use the cbind function to construct our matrix;

A <- cbind(rep(1,20),CAPM$M, CAPM$r)

We may then use the results of K10.21 of EH to conclude that the MLE of \(\theta\) may be found in as

MLETheta <- solve(t(A)%*%A)%*%t(A)%*%CAPM$R

Also by EH K10.21 the variance matrix of \(\hat\theta\) will be \(\sim\mathcal{N}\mathopen{}\left({{\theta},{\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}}}\right)\mathclose{},\) which we by EH K10.21 may find in through

sigmat2 <- as.numeric((t(R)%*%R-t(MLETheta)%*%t(A)%*%A%*%MLETheta)/(N-k)) 
## [1] 40.88448
thetaCovMat <- sigmat2*solve(t(A)%*%A)
##             [,1]       [,2]        [,3]
## [1,]  38.0005763 -0.2149795 -85.5238319
## [2,]  -0.2149795  0.1136148  -0.1766915
## [3,] -85.5238319 -0.1766915 207.6366120

Noting that defining \(\tilde\sigma^2=\)sigmat2 with as.numeric is rather important in accordance with not generating errors in the subsequent multiplication defining thetaCovMat.

2.

With the above calculated distribution of \(\hat\theta\equiv\mathopen{}\left({\hat\alpha,\,\hat\beta,\,\hat\gamma}\right)\mathclose{}^T\) we may by EH L9.47 conclude that \[ \hat\beta\sim\mathcal{N}\mathopen{}\left({{\beta},{\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}_{2,2}}}\right)\mathclose{}\overset{\cdot}{=}\mathcal{N}\mathopen{}\left({{\beta},{0.1136148}}\right)\mathclose{},\\ \hat\gamma\sim\mathcal{N}\mathopen{}\left({{\gamma},{\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}_{3,3}}}\right)\mathclose{}\overset{\cdot}{=}\mathcal{N}\mathopen{}\left({{\gamma},{207.636612}}\right)\mathclose{}. \] But more importantly \(\hat\beta+\hat\gamma=\mathopen{}\left({0,1,1}\right)\mathclose{}\hat\theta,\) such that we by K9.46 of EH may conclude that \[ \hat\beta+\hat\gamma=\mathopen{}\left({0,1,1}\right)\mathclose{}\hat\theta\sim\mathcal{N}\mathopen{}\left({{\mathopen{}\left({0,1,1}\right)\mathclose{}\theta},{\,\mathopen{}\left({0,1,1}\right)\mathclose{}\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}\begin{pmatrix}{0\\1\\1}\end{pmatrix}}}\right)\mathclose{}\\ =\mathcal{N}\mathopen{}\left({{\beta+\gamma},{\,\mathopen{}\left({0,1,1}\right)\mathclose{}\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}\begin{pmatrix}{0\\1\\1}\end{pmatrix}}}\right)\mathclose{} \] with the above variance matrix calculated in as

selector <- matrix(c(0,1,1),nrow=1)
bgvar <- as.numeric(selector%*%thetaCovMat%*%t(selector))
## [1] 207.3968

As such, if we wanted to write the distribution of \(\hat\beta+\hat\gamma\) on the form \(\hat\beta+\hat\gamma\sim\mathcal{N}\mathopen{}\left({{\beta+\gamma},{se^2}}\right)\mathclose{}\) we should choose \(\tilde{se}^2=\)bgvar\(=207.3968439,\) such that \(\tilde{se}=14.4012792.\)

Having calculated the central estimator for \(\sigma,\,\tilde\sigma=40.8844839\) writing \(se^2=\sigma^2\cdot a\approx\tilde\sigma^2\cdot a,\) we may find \(a\) as \[ a=\frac{se^2}{\sigma^2}\underset{\sim}{\overset{\cdot}{=}}\frac{207.3968439}{40.8844839}=5.0727519 \]

3.

4.


EH problems

EH 9.4

EH 9.7

EH 9.12

EH 9.13

EH 9.20

EH 10.5

For \(X_1,X_2,X_3,X_4,X_5\) independent normally distributed, each with same variance \(\sigma^2>0\), and individual expectation given by \[ EX_1=\alpha_1,\quad EX_2=\alpha_1-\alpha_2,\quad EX_3 = \alpha_1-\alpha_2,\quad EX_4 = \alpha_1+\alpha_2,\quad EX_5 = \alpha_1+\alpha_2, \] for some \(\alpha_1,\alpha_2\in\mathbb{R}\) being unknown parameters.

a)

Assume we have observed outcomes of the \(X\)’s such that \(X_i=x_i\) with \(x_i\) as assigned below

X <- c(-0.187,-1.731,-0.184,2.252,1.775)

Note that as \(X\equiv\left({X_1,\ldots,X_5}\right)^T\) is a boundling of five random variables and is perceived as a columnvector as such, heed that as \(\alpha\equiv\left({\alpha_1,\alpha_2}\right)^T\in M_{2,1}(\mathbb{R}),\) we will for \(A\alpha\) being our expected value vector (\(\in M_{5,1}(\mathbb{R})\)) need \(A\in M_{5,2}(\mathbb{R}).\)

Consequently, observe \[ A\alpha \equiv \begin{pmatrix} a_{1,1} & a_{1,2}\\ a_{2,1} & a_{2,2}\\ a_{3,1} & a_{3,2}\\ a_{4,1} & a_{4,2}\\ a_{5,1} & a_{5,2} \end{pmatrix}\begin{pmatrix}{\alpha_1\\\alpha_2}\end{pmatrix}\equiv\begin{pmatrix}{a_{1,1}\\a_{2,1}\\a_{3,1}\\a_{4,1}\\a_{5,1}}\end{pmatrix}\alpha_1+\begin{pmatrix}{a_{1,2}\\a_{2,2}\\a_{3,2}\\a_{4,2}\\a_{5,2}}\end{pmatrix}\alpha_2, \] such that as we are seeking \(\begin{pmatrix}{\alpha_1\\\alpha_1-\alpha_2\\\alpha_1-\alpha_2\\\alpha_1+\alpha_2\\\alpha_1+\alpha_2}\end{pmatrix}\equiv EX=A\alpha,\) we may read off the “coefficients” columns of \(\alpha_1\) and \(\alpha_2\) with \(a_{i,1}=1,\,i=1,\ldots,5,\) and \(a_{1,2}=0,\,a_{2,2}=-1,\,a_{3,2}=-1,\,a_{4,2}=1,\,a_{5,2}=1,\) such that \[ A:=\begin{pmatrix} 1&0\\ 1&-1\\ 1&-1\\ 1&1\\ 1&1, \end{pmatrix} \] will be our designated designmatrix.

In accordance with K10.21 in EH we may define our designmatrix in as

A <- matrix(c(1,0,1,-1,1,-1,1,1,1,1), nrow = 5, ncol = 2, byrow = T) #By col is default

By EH K10.21 we will have \(\hat\alpha = \left({A^TA}\right)^{-1}A^TX\) and as such we may define in

puA <- solve(t(A)%*%A)%*%t(A)
alphahat <- puA%*%X
##        [,1]
## [1,] 0.3850
## [2,] 1.4855

and as such we may therefore conclude that the MLE for \(\alpha\), \(\hat\alpha\equiv\begin{pmatrix}{\hat\alpha_1\\\hat\alpha_2}\end{pmatrix}=\begin{pmatrix}{0.385\\1.4855}\end{pmatrix}.\)

b)

By page 413 in EH for \(A\) being a \(N\times k\overset{\cdot}{=}5\times 2\) matrix we will by page 415 have that we may estimate \(\tilde\sigma^2\) on the basis of data as \[ \tilde{\sigma}^2=\frac{X^TX-\hat{\alpha}^TA^TA\hat{\alpha}}{N-k}\overset{\cdot}{=}\frac{X^TX-\hat{\alpha}^TA^TA\hat{\alpha}}{3} \] which we may calculate in with

sigmatilde2 <- as.numeric((t(X)%*%X-t(alphahat)%*%t(A)%*%A%*%alphahat)/3) #does as.numeric change further calculations?
## [1] 0.5731163

c)

By EH K10.21 we have that the MLE for expected value, \(\hat\alpha\) and MLE for variance, \(\hat\sigma^2\) are independent, and as such, as the centralized estimator for the variance \(\tilde{\sigma}^2:=N\frac{\hat\sigma^2}{N-k},\) we get by measurable transformations that we will also have independence of MLE for expected value, \(\hat\alpha,\) and \(\tilde\sigma^2.\)

By independence we get that the joint density of \(\hat\alpha\) and \(\tilde\sigma^2\) can be obtained throught the product of the marginal densities. Note in this regard that by K10.21

\[ \hat\alpha\sim\mathcal{N}(\alpha,\sigma^2\left({A^TA}\right)^{-1}) \] and \[ \hat\sigma^2\sim\frac{\sigma^2}{N}\chi^2_{N-k} \] as in \(\hat\sigma^2\) is \(\chi^2\) distributed with \(k\) degrees of freedom and “scale-parameter” \(\frac{\sigma^2}{N}\) which in itself is to be interpreted in the sense that the \(\frac{1}{\frac{\sigma^2}{N}}\hat\sigma^2\equiv\frac{N}{\sigma^2}\hat\sigma^2\sim\chi^2_{N-k}\) is “proper” \(\chi^2\) distributed with \(N-k\) degrees of freedom. Consequently as \(\tilde{\sigma}^2:=N\frac{\hat\sigma^2}{N-k},\) we will therefore have that \[ \tilde\sigma^2\sim N\frac{\frac{\sigma^2}{N}}{N-k}\chi^2_{N-k}\equiv\frac{\sigma^2}{N-k}\chi^2_{N-k}. \] The density for some \(Q\sim\chi_q^2\) is \(\frac{1}{2^{\frac{q}{2}}\Gamma(\frac{q}{2})}x^{\frac{q}{2}-1}e^{-\frac{x}{2}}\) such that we will have that the density of \(\tilde\sigma^2\) is going to be \[\begin{align*} f_{\tilde\sigma^2}(x)\equiv f_{\frac{\sigma^2}{N-k}\chi_{N-k}^2}(x)&=\frac{N-k}{\sigma^2}\frac{1}{2^{\frac{N-k}{2}}\Gamma(\frac{N-k}{2})}x^{\frac{N-k}{2}-1}e^{-\frac{x}{2}}\overset{\cdot}{=}\frac{5-2}{\sigma^2}\frac{1}{2^{\frac{5-2}{2}}\Gamma(\frac{5-2}{2})}x^{\frac{5-2}{2}-1}e^{-\frac{x}{2}}\\ &\equiv \frac{3}{\sigma^2}\frac{1}{2^{\frac{3}{2}}\Gamma(\frac{3}{2})}x^{\frac{3}{2}-1}e^{-\frac{x}{2}}, \end{align*}\] hence, as for \(f_{\hat\alpha}\) being the density for \(\hat\alpha\) we get that the joint density of \(\hat\alpha\) and \(\tilde\sigma^2,\) \(f_{\hat\alpha,\tilde\sigma^2}\) will be \[ f_{\hat\alpha,\tilde\sigma^2}=f_{\hat\alpha}(x)\cdot f_{\tilde\sigma^2}(x). \]

d)

We may deploy one of two methods, in order to calculate confidence intervals for \(\alpha_1,\alpha_2.\)

Method 1: EH Example 10.30

Note that we by a) may write \[ X\equiv\left({X_1,\ldots,X_5}\right)^T\sim\mathcal{N}\left({{A\alpha},{\sigma^2I}}\right). \] Define \(\varphi_1:\mathbb{R}^2\rightarrow\mathbb{R},\,\varphi_2:\mathbb{R}^2\rightarrow\mathbb{R},\) for \(\omega\equiv\begin{pmatrix}{\omega_1\\\omega_2}\end{pmatrix}\in\mathbb{R}^2\) as the coordinate projections \(\varphi_1(\omega):=\omega_1,\,\varphi_2(\omega):=\omega_2\) respectively. For \(e_1:=\begin{pmatrix}{1\\0}\end{pmatrix},\,e_2:=\begin{pmatrix}{0\\1}\end{pmatrix},\) observe \[ \varphi_1(\omega)=e_1^T\omega=\omega_1,\quad\varphi_2(\omega)=e_2^T\omega=\omega_2. \] \(1-\alpha^*\) confidence intervals for \(\varphi_1(\alpha)=\alpha_1,\,\varphi_2(\alpha)=\alpha_2\) may be obtained as \[\begin{align} KI_{\alpha_1}&=\left({e_1^T\hat\alpha-\sqrt{e_1^T\left({A^TA}\right)^{-1}e_1f_{\alpha^*}\tilde\sigma^2},\,e_1^T\hat\alpha+\sqrt{e_1^T\left({A^TA}\right)^{-1}e_1f_{\alpha^*}\tilde\sigma^2}}\right)\\ KI_{\alpha_2}&=\left({e_2^T\hat\alpha-\sqrt{e_2^T\left({A^TA}\right)^{-1}e_2f_{\alpha^*}\tilde\sigma^2},\,e_2^T\hat\alpha+\sqrt{e_2^T\left({A^TA}\right)^{-1}e_2f_{\alpha^*}\tilde\sigma^2}}\right) \end{align}\] by EH Example 10.30, with \(f_{\alpha^*}\) being the \(1-\alpha^*\) quantile for the \(F\)-distribution with \(\left({1,N-k}\right)\) degrees of freedom, for \(A\in M_{N,k}\overset{\cdot}{=}M_{5,2}\) of rank \(k\overset{\cdot}{=}2\).

Having looked through ?qf, an implementation in for \(\alpha^*=0.05\) may be obtained with

falphastar <- qf(p = 0.95, df1 = 1, df2 = nrow(A)-ncol(A)) #f_{\alpha^*}
e_1 <- c(1,0)
e_2 <- c(0,1)
sqhelper_1 <- as.numeric(sqrt(t(e_1)%*%solve(t(A)%*%A)%*%e_1*falphastar*sigmatilde2))
## [1] 1.077451
sqhelper_2 <- as.numeric(sqrt(t(e_2)%*%solve(t(A)%*%A)%*%e_2*falphastar*sigmatilde2))
## [1] 1.204627
KI_alpha1_1 <- c(e_1%*%alphahat-sqhelper_1, e_1%*%alphahat+sqhelper_1)
## [1] -0.6924509  1.4624509
KI_alpha2_1 <- c(e_2%*%alphahat-sqhelper_2, e_2%*%alphahat+sqhelper_2)
## [1] 0.2808733 2.6901267

Other methods to define the sqhelper’s

Note that we could’ve also defined the sqhelper’s with

sqrt(solve(t(A)%*%A)[1,1]*falphastar*sigmatilde2)
## [1] 1.077451
sqrt(solve(t(A)%*%A)[2,2]*falphastar*sigmatilde2)
## [1] 1.204627

as \(e_i^TBe_i=b_{ii}\) for any standard basisvector \(e_i\) of appropriate size to be multiplied with some matrix \(B.\) Finally we might also choose to forego the selection of diagonal element untill it is needed;

sqhelper <- sqrt(solve(t(A)%*%A)*falphastar*sigmatilde2)
sqhelper[1,1]
## [1] 1.077451
sqhelper[2,2]
## [1] 1.204627
Method 2: EH K10.21

As \[ \hat\alpha\sim\mathcal{N}(\alpha,\sigma^2\left({A^TA}\right)^{-1})\overset{\cdot}{=}\mathcal{N}\left({{\begin{pmatrix}{\alpha_1\\\alpha_2}\end{pmatrix}},{\begin{pmatrix} \frac{\sigma^2}{5}&0\\ 0&\frac{\sigma^2}{4}\end{pmatrix}}}\right) \] we get by EH L9.47 that \[ \hat\alpha_1\sim\mathcal{N}\left({{\alpha_1},{\frac{\sigma^2}{5}}}\right),\quad\hat\alpha_2\sim\mathcal{N}\left({{\alpha_2},{\frac{\sigma^2}{4}}}\right). \] By the likes of “Introduktion til Statistik” T4.7 we may note that \[ \hat\alpha_1\pm t_{n-1,1-\frac{\alpha^*}{2}}SE(\hat\alpha_1),\quad\hat\alpha_2\pm t_{n-1,1-\frac{\alpha^*}{2}}SE(\hat\alpha_2), \] will be \(1-\alpha^*\) confidence intervals for \(\alpha_1,\alpha_2\) respectively. In our case, we have \(n=5\) datapoints, and are seeking \(95\%\)-confidence intervals such that \(\alpha^*=0.05.\)

talphastar <- qt(0.975,4) #2.776445
KI_alpha1_2 <- c(alphahat[1]-talphastar*sqrt(sigmatilde2*solve(t(A)%*%A)[1,1]), alphahat[1]+talphastar*sqrt(sigmatilde2*solve(t(A)%*%A)[1,1]))
## [1] -0.5549949  1.3249949
KI_alpha2_2 <- c(alphahat[2]-talphastar*sqrt(sigmatilde2*solve(t(A)%*%A)[2,2]), alphahat[2]+talphastar*sqrt(sigmatilde2*solve(t(A)%*%A)[2,2]))
## [1] 0.4345538 2.5364462

EH 11.3

We can import the data with

EHopg11dot3 <- read_table2("EHopg11dot3.txt", 
    col_types = cols(Alder = col_integer()))
## Warning: 19 parsing failures.
## row col  expected    actual              file
##   2  -- 3 columns 4 columns 'EHopg11dot3.txt'
##   3  -- 3 columns 4 columns 'EHopg11dot3.txt'
##   4  -- 3 columns 4 columns 'EHopg11dot3.txt'
##   5  -- 3 columns 4 columns 'EHopg11dot3.txt'
##   6  -- 3 columns 4 columns 'EHopg11dot3.txt'
## ... ... ......... ......... .................
## See problems(...) for more details.

a)

We may subset the data according to profession with

jour <- subset(EHopg11dot3, EHopg11dot3$Fag == "J")
uni <-  subset(EHopg11dot3, EHopg11dot3$Fag == "U")

We may then plot bloodpressure across from age in each of the two groups;

plot(jour$Alder, jour$Blodtryk, ylab = "Bloodpressure of journalists (mm Hg)", xlab = "Years of age")
plot(uni$Alder, uni$Blodtryk, ylab = "Bloodpressure of University teachers (mm Hg)", xlab = "Years of age")

Looking at the plots, we figure that as a function of the low number of data points, it doesn’t look unreasonable to assume that there might be a linear relation between bloodpressure and age.

b)

We may steal the statistical model as described in IS D6.1, recited as

We assume \(Y_1,\ldots,Y_n \overset{\text{iid.}}{\sim}\mathcal{N}(\alpha+\beta x_i,\sigma^2),\,\)with \(\alpha,\beta\in\mathbb{R},\,\sigma^2>0\) being unknown parameters.

for each of the two groupings. Implementing this in and pearing at the summary is done through

jourmodel <- lm(jour$Blodtryk ~ jour$Alder)
summary(jourmodel)
## 
## Call:
## lm(formula = jour$Blodtryk ~ jour$Alder)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.644 -2.881  1.274  1.849  7.219 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  84.9952     7.5966   11.19 2.38e-07 ***
## jour$Alder    1.5296     0.1325   11.54 1.74e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.26 on 11 degrees of freedom
## Multiple R-squared:  0.9237, Adjusted R-squared:  0.9168 
## F-statistic: 133.2 on 1 and 11 DF,  p-value: 1.737e-07

just as well as we might draw estimates for \(\alpha_{j},\,\beta_j\) out of with

alphaj <- jourmodel$coefficients[[1]]
## [1] 84.99522
betaj <- jourmodel$coefficients[[2]]
## [1] 1.529568

c)

We copy the procedure of subproblem b;

unimodel <- lm(uni$Blodtryk ~ uni$Alder)
summary(unimodel)
## 
## Call:
## lm(formula = uni$Blodtryk ~ uni$Alder)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.248 -1.673  1.127  1.602  6.814 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  75.6299    10.2884   7.351 5.58e-06 ***
## uni$Alder     1.5624     0.1817   8.597 1.01e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.41 on 13 degrees of freedom
## Multiple R-squared:  0.8504, Adjusted R-squared:  0.8389 
## F-statistic: 73.91 on 1 and 13 DF,  p-value: 1.008e-06

just as well as we might draw estimates for \(\alpha_{j},\,\beta_j\) out of with

alphau <- unimodel$coefficients[[1]]
## [1] 75.62986
betau <- unimodel$coefficients[[2]]
## [1] 1.562384

d)

We may implement an aggregate linear regression model for bloodpressure across the two groupings, with each group responsible for a set of parameters. Note however that the standard model <- lm(EHopg11dot3$Blodtryk ~ jour$Alder+uni$Alder) won’t work, as we are asking to model the \(28\) datapoints of the dataset based on \(13\) and \(15\) datapoints respectively. Instead note that what we desire is of the format \[ EX_i=1_J(i)\mathopen{}\left({\alpha_J+\beta_Jt_i}\right)\mathclose{}+1_U(i)\mathopen{}\left({\alpha_U+\beta_Ut_i}\right)\mathclose{} \] As such we may then implement the indicator functions \(1_U,1_J\) in the estimates “via the data,” which we will do by creating the matrix

A <-matrix(c(rep(1,13),rep(0,15), jour$Alder,rep(0,15),rep(0,13),rep(1,15),rep(0,13),uni$Alder),nrow=28)
A
##       [,1] [,2] [,3] [,4]
##  [1,]    1   68    0    0
##  [2,]    1   62    0    0
##  [3,]    1   49    0    0
##  [4,]    1   62    0    0
##  [5,]    1   34    0    0
##  [6,]    1   66    0    0
##  [7,]    1   45    0    0
##  [8,]    1   60    0    0
##  [9,]    1   57    0    0
## [10,]    1   57    0    0
## [11,]    1   63    0    0
## [12,]    1   56    0    0
## [13,]    1   57    0    0
## [14,]    0    0    1   55
## [15,]    0    0    1   49
## [16,]    0    0    1   60
## [17,]    0    0    1   54
## [18,]    0    0    1   58
## [19,]    0    0    1   51
## [20,]    0    0    1   43
## [21,]    0    0    1   52
## [22,]    0    0    1   53
## [23,]    0    0    1   61
## [24,]    0    0    1   61
## [25,]    0    0    1   57
## [26,]    0    0    1   57
## [27,]    0    0    1   70
## [28,]    0    0    1   63

We will thus be able to do;

modelgroups <- lm(EHopg11dot3$Blodtryk ~ A-1)
summary(modelgroups)
## 
## Call:
## lm(formula = EHopg11dot3$Blodtryk ~ A - 1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.644 -3.019  1.201  1.785  7.219 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## A1  84.9952     7.7426  10.978 7.71e-11 ***
## A2   1.5296     0.1351  11.322 4.12e-11 ***
## A3  75.6299    10.1296   7.466 1.05e-07 ***
## A4   1.5624     0.1789   8.732 6.47e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.342 on 24 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9993 
## F-statistic: 1.045e+04 on 4 and 24 DF,  p-value: < 2.2e-16
We may recognize the already calculated estimates in the summary.
Pulling estimates from model

The estimates might be found from model in via

modelgroups$coefficients[[1]]
modelgroups$coefficients[[2]]
modelgroups$coefficients[[3]]
modelgroups$coefficients[[4]]

e)

We may then finally do a full linear regression model based on the entire dataset, with no splitting into groups;

model <- lm(EHopg11dot3$Blodtryk~EHopg11dot3$Alder)

The estimates may once again be collected via

alpha <- model$coefficients[[1]]
alpha
## [1] 79.6603
beta <- model$coefficients[[2]]
beta
## [1] 1.552729

f)

EH 12.3

READ ’AN INTRODUCTION TO GENERAL LINEAR MODELS CHAPTER 6.4

As there is no data to read in, we may write it ourselves;

(data <- tibble(
   koag = c(62,60,63,59, 63,67,71,64,65,66, 68,66,71,67,68,68, 56,62,60,61,63,64,63,59),
   Diaet = factor(c(rep("A",4),rep("B", 6), rep("C",6), rep("D",8)), levels = c("A","B", "C", "D"))))
## # A tibble: 24 x 2
##     koag Diaet
##    <dbl> <fct>
##  1    62 A    
##  2    60 A    
##  3    63 A    
##  4    59 A    
##  5    63 B    
##  6    67 B    
##  7    71 B    
##  8    64 B    
##  9    65 B    
## 10    66 B    
## # ... with 14 more rows

a)

The linear model for a factor trial of the data with the Diaet factor \(F:\mathopen{}\left\{{1,\ldots,24}\right\}\mathclose{}\rightarrow\mathopen{}\left\{{A,B,C,D}\right\}\mathclose{},\) is the following

For \(\mathopen{}\left({X_i}\right)\mathclose{}_{i\in I:=\mathopen{}\left\{{1,\ldots,24}\right\}\mathclose{}}\), with \(X_i\sim\mathcal{N}\mathopen{}\left({{\xi_i},{\sigma^2}}\right)\mathclose{},\) we assume \(\xi\equiv\mathopen{}\left({\xi_i}\right)\mathclose{}_{i\in I}\in L_F,\) i.e. that \(f(i)=f(i')\Rightarrow\xi_i = \xi_{i'}.\)

b)

We may start off with doing a visual assesment of the variance and normality assumptions via a standardized residual and QQ plot, as was done in HS16.

Stdresplot <- function(model, main = paste("(Estimate, Std. Res.)-plot of", deparse(substitute(model))), ylab ="Standardized residuals", ...) {
 fit <- fitted(model)
 rst <- rstandard(model)
 qplot(fit, rst, main = main, ylab = ylab, ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) )+geom_hline(yintercept = 0) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
}
QQplotdraw <- function(model, main = paste("Normal QQ-plot of", deparse(substitute(model))), xlab = "Theoretical Quantiles", ylab ="Sample Quantiles", ...) {
   rst <- rstandard(model)
   #dataname <- getCall(lm_LT)$data
   ggplot(data = eval(getCall(model)$data), main = main, xlab = xlab, ylab = ylab) + geom_qq() + geom_qq_line() + aes(sample = rst)
} #main, xlab, ylab call do not work for some reason
StdresQQPlot <- function(model,...) {
   p1 <- Stdresplot(model,...)
   p2 <- QQplotdraw(model,...)
   library(gridExtra)
   grid.arrange(p1,p2, ncol = 2)
}

lmD <- lm(koag ~ Diaet - 1, data = data)
summary(lmD)
## 
## Call:
## lm(formula = koag ~ Diaet - 1, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5.00  -1.25   0.00   1.25   5.00 
## 
## Coefficients:
##        Estimate Std. Error t value Pr(>|t|)    
## DiaetA  61.0000     1.1832   51.55   <2e-16 ***
## DiaetB  66.0000     0.9661   68.32   <2e-16 ***
## DiaetC  68.0000     0.9661   70.39   <2e-16 ***
## DiaetD  61.0000     0.8367   72.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.366 on 20 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9986 
## F-statistic:  4399 on 4 and 20 DF,  p-value: < 2.2e-16
StdresQQPlot(lmD)

The std. res. plot gives no indication that the variance assumptions are off, given the limited amount of data. See the likes of HS2 for an in-depth description of what to be looking for. Looking at the QQplot, we see very little that would suggest, that that the normality assumption is off.

Having established the framework, we can clarify that the estimated mean of the groups \(A,B,C,D\) are \(61,66,68,61.\) Note also that by (12.7), the estimates for the mean of each group are by the model to be normally distributed with a central mean estimate, and a variance equal to the total variance of the data, divided by the number of observations for the particular group. As such we get that the variances can be calculated by

(sig2 <- (summary(lmD)$sigma)^2)
## [1] 5.6
sig2/(count(data, Diaet)[[2]])
## [1] 1.4000000 0.9333333 0.9333333 0.7000000

c)

We see that we have pull out the cental variance parameter of the model in subproblem two. By EH page 501, we see that the cental estimate of variance according to the model will be \(\chi^2\) distributed with \(|I|-|F| \overset{\cdot}{=}24-4=20\) degrees of freedom, and scale parameter \(\frac{\sigma^2}{|I|-|F|},\) once again with \(\sigma^2\) being the true variance.

d)

We may test whether there is any difference between the groups, ie. test whether the mean value vector is in \(L_1.\)

lm_L1 <- lm(koag ~ 1, data)
(an <- anova(lm_L1, lmD))
## Analysis of Variance Table
## 
## Model 1: koag ~ 1
## Model 2: koag ~ Diaet - 1
##   Res.Df RSS Df Sum of Sq      F    Pr(>F)    
## 1     23 340                                  
## 2     20 112  3       228 13.571 4.658e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Of which we in particular see that that we get ab estimated \(F\) value of \(13.571\), and therefore a \(p-value\) of \(\cong 4.66\cdot 10^{-5},\) with we consider low enough to warrent tossing a hypothesis of the groups being distributed the same.

e)

We may use the command

confint(lmD)
##           2.5 %   97.5 %
## DiaetA 58.53185 63.46815
## DiaetB 63.98477 68.01523
## DiaetC 65.98477 70.01523
## DiaetD 59.25476 62.74524

to see that a \(95\%\) confidence interval for blod clodding for patients in diaet group A is \(\mathopen{}\left({58.53185, 63.46815}\right)\mathclose{}.\)

f)

We might create a new data set with only the \(A\)-data, and do an analysis on this;

diA <- subset(data, Diaet == "A") #remember the double '=='
lm_A <- lm(koag~1, data = diA)
summary(lm_A)
## 
## Call:
## lm(formula = koag ~ 1, data = diA)
## 
## Residuals:
##  1  2  3  4 
##  1 -1  2 -2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.0000     0.9129   66.82 7.39e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.826 on 3 degrees of freedom
confint(m2)
##                 2.5 %     97.5 %
## (Intercept) -4.423050 33.7966121
## Midarm      -0.486795  0.8856523
sigma(m2)^2
## [1] 26.96322

g)

For the two confidence for diaet A we see that they are; \(\mathopen{}\left({58.53185, 63.46815}\right)\mathclose{},\,\mathopen{}\left({58.09484,63.90516}\right)\mathclose{},\) we see that they are fairly similar.\

We may do the same with diaet D:

diD <- subset(data, Diaet == "D")
lm_D <- lm(koag~1, data = diD)
confint(lmD)
##           2.5 %   97.5 %
## DiaetA 58.53185 63.46815
## DiaetB 63.98477 68.01523
## DiaetC 65.98477 70.01523
## DiaetD 59.25476 62.74524
confint(lm_D)
##                2.5 %   97.5 %
## (Intercept) 58.81078 63.18922

EH 13.1

We may get create the data as was done in EH problem 12.3:

(EH131 <- tibble(
   vaegt = c(196,269,248,130,149,195,148,147,159,470,172,292,214,349,253,338,345,266,360,224,312,409,479,292,443,386,381,388,349,458,470,306),
   vand = factor(c(rep("A",4),rep("B", 4), rep("C",4), rep("D",4)) %>% rep(.,2) %>% c(.), levels = c("A","B","C","D")),
   goed = factor(c(rep("I", 16), rep("II", 16)))))
## # A tibble: 32 x 3
##    vaegt vand  goed 
##    <dbl> <fct> <fct>
##  1   196 A     I    
##  2   269 A     I    
##  3   248 A     I    
##  4   130 A     I    
##  5   149 B     I    
##  6   195 B     I    
##  7   148 B     I    
##  8   147 B     I    
##  9   159 C     I    
## 10   470 C     I    
## # ... with 22 more rows

a)

We might do a one sided variance analysis for the product factor \(G \times V.\) Noting the interpretation of as one sided analysis of variance with a product factor as presented in EH page 521-522, of the model that every combination of the involved labels have their own mean, such that we might ask if it is the case that all treatments have the same distribution (same mean, as we assume that data \(X_i\sim\mathcal{N}\mathopen{}\left({{\xi_i},{\sigma^2}}\right)\mathclose{}).\)

We may thus start off with doing model-control on the variance homogeneity and normal-distribution assumptions of the model, in order to facilitate the background for the anova framework. We may thus turn to the standard visual model validation methods of standardized residual and QQ plots, as was also done in HS 16 and EH 12.3

Note in particular that we in implement the lm model of the product factor as a product in the following sense: Should we, or should we not do -1?!!!

lmEH131 <- lm(vaegt ~ vand * goed -1, data = EH131)
model.matrix(lmEH131)
##    vandA vandB vandC vandD goedII vandB:goedII vandC:goedII vandD:goedII
## 1      1     0     0     0      0            0            0            0
## 2      1     0     0     0      0            0            0            0
## 3      1     0     0     0      0            0            0            0
## 4      1     0     0     0      0            0            0            0
## 5      0     1     0     0      0            0            0            0
## 6      0     1     0     0      0            0            0            0
## 7      0     1     0     0      0            0            0            0
## 8      0     1     0     0      0            0            0            0
## 9      0     0     1     0      0            0            0            0
## 10     0     0     1     0      0            0            0            0
## 11     0     0     1     0      0            0            0            0
## 12     0     0     1     0      0            0            0            0
## 13     0     0     0     1      0            0            0            0
## 14     0     0     0     1      0            0            0            0
## 15     0     0     0     1      0            0            0            0
## 16     0     0     0     1      0            0            0            0
## 17     1     0     0     0      1            0            0            0
## 18     1     0     0     0      1            0            0            0
## 19     1     0     0     0      1            0            0            0
## 20     1     0     0     0      1            0            0            0
## 21     0     1     0     0      1            1            0            0
## 22     0     1     0     0      1            1            0            0
## 23     0     1     0     0      1            1            0            0
## 24     0     1     0     0      1            1            0            0
## 25     0     0     1     0      1            0            1            0
## 26     0     0     1     0      1            0            1            0
## 27     0     0     1     0      1            0            1            0
## 28     0     0     1     0      1            0            1            0
## 29     0     0     0     1      1            0            0            1
## 30     0     0     0     1      1            0            0            1
## 31     0     0     0     1      1            0            0            1
## 32     0     0     0     1      1            0            0            1
## attr(,"assign")
## [1] 1 1 1 1 2 3 3 3
## attr(,"contrasts")
## attr(,"contrasts")$vand
## [1] "contr.treatment"
## 
## attr(,"contrasts")$goed
## [1] "contr.treatment"
lmEH131_1 <- lm(vaegt ~ 1, data = EH131)
anova(lmEH131_1, lmEH131)
## Analysis of Variance Table
## 
## Model 1: vaegt ~ 1
## Model 2: vaegt ~ vand * goed - 1
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     31 357537                                
## 2     24 146008  7    211529 4.9671 0.001392 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lmEH131)
## 
## Call:
## lm(formula = vaegt ~ vand * goed - 1, data = EH131)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -114.25  -50.31  -11.62   47.06  196.75 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## vandA          210.75      39.00   5.404 1.50e-05 ***
## vandB          159.75      39.00   4.096 0.000413 ***
## vandC          273.25      39.00   7.007 3.03e-07 ***
## vandD          288.50      39.00   7.398 1.23e-07 ***
## goedII          88.00      55.15   1.596 0.123671    
## vandB:goedII   125.25      78.00   1.606 0.121394    
## vandC:goedII    38.25      78.00   0.490 0.628306    
## vandD:goedII    19.25      78.00   0.247 0.807160    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78 on 24 degrees of freedom
## Multiple R-squared:  0.9549, Adjusted R-squared:  0.9398 
## F-statistic: 63.48 on 8 and 24 DF,  p-value: 2.854e-14
summary(lm(vaegt ~ vand * goed, data = EH131))
## 
## Call:
## lm(formula = vaegt ~ vand * goed, data = EH131)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -114.25  -50.31  -11.62   47.06  196.75 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    210.75      39.00   5.404  1.5e-05 ***
## vandB          -51.00      55.15  -0.925    0.364    
## vandC           62.50      55.15   1.133    0.268    
## vandD           77.75      55.15   1.410    0.171    
## goedII          88.00      55.15   1.596    0.124    
## vandB:goedII   125.25      78.00   1.606    0.121    
## vandC:goedII    38.25      78.00   0.490    0.628    
## vandD:goedII    19.25      78.00   0.247    0.807    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78 on 24 degrees of freedom
## Multiple R-squared:  0.5916, Adjusted R-squared:  0.4725 
## F-statistic: 4.967 on 7 and 24 DF,  p-value: 0.001392

DIFFERENCE BETWEEN ONE WAY MULTIFACTOR ANALYSIS OF VARIANCE IN R, AND TWO WAY MULTIFACTOR ANALYSIS OF VARIANCE IN R

b)

c)

EH 13.4

a)

The linear model based on the product factor can be described by the following

For \(\mathopen{}\left({X_i}\right)\mathclose{}_{i\in I:=\mathopen{}\left\{{1,\ldots,24}\right\}\mathclose{}}\) independent, with \(X_i\sim\mathcal{N}\mathopen{}\left({{\xi_i},{\sigma^2}}\right)\mathclose{},\) we assume \(\xi\equiv\mathopen{}\left({\xi_i}\right)\mathclose{}_{i\in I}\in L_{Bl\times Br}\) i.e. the hypothesis is \[H_{Br\times Bl}:\xi_i = \alpha_{Br\mathopen{}\left({i}\right)\mathclose{}} + \beta_{Bl\mathopen{}\left({i}\right)\mathclose{}} + \alpha_{Br\mathopen{}\left({i}\right)\mathclose{}}\beta_{Bl\mathopen{}\left({i}\right)\mathclose{}},\quad\forall i\in I,\] for vectors \(\underline{\alpha},\,\underline{\beta}\).

We may load in the data;

opgave13_4 <- read_table2("opgave13_4.txt", 
    col_types = cols(Blander = col_factor(levels = c("1", 
        "2", "3")), Bryder = col_factor(levels = c("1", 
        "2", "3"))))

Note in this regard that one could, in accordance with matching the labels of the Blander with the labels used in the problem statement, mutate the data set.

We may implement the model in with

model134 <- lm(Respons ~ Blander*Bryder, data = opgave13_4)
summary(model134)
## 
## Call:
## lm(formula = Respons ~ Blander * Bryder, data = opgave13_4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -655.0 -347.5    0.0  290.0 1210.0 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5340.0      262.0  20.379   <2e-16 ***
## Blander2           -295.0      370.6  -0.796    0.433    
## Blander3            335.0      370.6   0.904    0.374    
## Bryder2            -350.0      370.6  -0.944    0.353    
## Bryder3            -525.0      370.6  -1.417    0.168    
## Blander2:Bryder2    650.0      524.1   1.240    0.226    
## Blander3:Bryder2     90.0      524.1   0.172    0.865    
## Blander2:Bryder3     -5.0      524.1  -0.010    0.992    
## Blander3:Bryder3   -232.5      524.1  -0.444    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 524.1 on 27 degrees of freedom
## Multiple R-squared:  0.3542, Adjusted R-squared:  0.1628 
## F-statistic: 1.851 on 8 and 27 DF,  p-value: 0.1108

We may do the standard visual check of the model assumptions regarding variance, and normality:

Stdresplot <- function(model, main = paste("(Estimate, Std. Res.)-plot of", deparse(substitute(model))), ylab ="Standardized residuals", ...) {
 fit <- fitted(model)
 rst <- rstandard(model)
 qplot(fit, rst, main = main, ylab = ylab, ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) )+geom_hline(yintercept = 0) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
}
QQplotdraw <- function(model, main = paste("Normal QQ-plot of", deparse(substitute(model))), xlab = "Theoretical Quantiles", ylab ="Sample Quantiles", ...) {
   rst <- rstandard(model)
   #dataname <- getCall(lm_LT)$data
   ggplot(data = eval(getCall(model)$data), main = main, xlab = xlab, ylab = ylab) + geom_qq() + geom_qq_line() + aes(sample = rst)
} #main, xlab, ylab call do not work for some reason
StdresQQPlot <- function(model,...) {
   p1 <- Stdresplot(model,...)
   p2 <- QQplotdraw(model,...)
   library(gridExtra)
   grid.arrange(p1,p2, ncol = 2)
}

StdresQQPlot(model134)

which for the most part (a couple of lower valued points seem to miss the trend in the QQplot) look reasonable.

We may also get the central variance estimate from the model with

summary(model134)
## 
## Call:
## lm(formula = Respons ~ Blander * Bryder, data = opgave13_4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -655.0 -347.5    0.0  290.0 1210.0 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5340.0      262.0  20.379   <2e-16 ***
## Blander2           -295.0      370.6  -0.796    0.433    
## Blander3            335.0      370.6   0.904    0.374    
## Bryder2            -350.0      370.6  -0.944    0.353    
## Bryder3            -525.0      370.6  -1.417    0.168    
## Blander2:Bryder2    650.0      524.1   1.240    0.226    
## Blander3:Bryder2     90.0      524.1   0.172    0.865    
## Blander2:Bryder3     -5.0      524.1  -0.010    0.992    
## Blander3:Bryder3   -232.5      524.1  -0.444    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 524.1 on 27 degrees of freedom
## Multiple R-squared:  0.3542, Adjusted R-squared:  0.1628 
## F-statistic: 1.851 on 8 and 27 DF,  p-value: 0.1108
sigma(model134)^2
## [1] 274647.2

In the case of the one-sided analysis of the product factor we might, as in EH 12.4 note that by EH p. 501, the central estimate for the variance, will be \(\chi^2\) distributed, with \(\mathopen{}\left|{I}\right|\mathclose{}-\mathopen{}\left|{Br\times Bl}\right|\mathclose{}\overset{\cdot}{=}36-3\cdot 3 = 27\) degrees of freedom, and scale parameter \(\frac{\sigma^2}{\mathopen{}\left|{I}\right|\mathclose{}-\mathopen{}\left|{Br\times Bl}\right|\mathclose{}}\overset{\cdot}{=}\frac{\sigma^2}{27}\), such that \(\frac{27}{\sigma^2}\tilde{\sigma}^2\sim\chi^2_{\mathopen{}\left|{I}\right|\mathclose{}-\mathopen{}\left|{Br\times Bl}\right|\mathclose{}}\)

b)

The additive hypothesis that there is independence of effects from the different factors, is the hypothesis that
For data being realizations of \(\mathopen{}\left({X_i}\right)\mathclose{}_{i\in I}\) with \(X_i\sim\mathcal{N}\mathopen{}\left({{\xi_i},{\sigma^2}}\right)\mathclose{}\) independent, the additive hypothesis in regards to the product factor \(Br\times Bl\) is the hypothesis that \[H_{Br + Bl}:\xi_i = \alpha_{Br\mathopen{}\left({i}\right)\mathclose{}} + \beta_{Bl\mathopen{}\left({i}\right)\mathclose{}},\quad\forall i\in I.\] for vectors \(\underline{\alpha},\,\underline{\beta}\)

The canonical static for a test of the additive model against the full model is the likelihood ratio test static, that might be found on page 111 of BMS.

Making the additive model in :

add <- lm(Respons ~ 0+Blander + Bryder, data = opgave13_4)
summary(add)
## 
## Call:
## lm(formula = Respons ~ 0 + Blander + Bryder, data = opgave13_4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -840.83 -291.67  -20.83  242.29 1099.17 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## Blander1   5284.2      190.3  27.774  < 2e-16 ***
## Blander2   5204.2      190.3  27.353  < 2e-16 ***
## Blander3   5571.7      190.3  29.285  < 2e-16 ***
## Bryder2    -103.3      208.4  -0.496  0.62353    
## Bryder3    -604.2      208.4  -2.899  0.00682 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 510.5 on 31 degrees of freedom
## Multiple R-squared:  0.9915, Adjusted R-squared:  0.9902 
## F-statistic: 726.1 on 5 and 31 DF,  p-value: < 2.2e-16
model.matrix(add)
##    Blander1 Blander2 Blander3 Bryder2 Bryder3
## 1         1        0        0       0       0
## 2         1        0        0       0       0
## 3         1        0        0       0       0
## 4         1        0        0       0       0
## 5         1        0        0       1       0
## 6         1        0        0       1       0
## 7         1        0        0       1       0
## 8         1        0        0       1       0
## 9         1        0        0       0       1
## 10        1        0        0       0       1
## 11        1        0        0       0       1
## 12        1        0        0       0       1
## 13        0        1        0       0       0
## 14        0        1        0       0       0
## 15        0        1        0       0       0
## 16        0        1        0       0       0
## 17        0        1        0       1       0
## 18        0        1        0       1       0
## 19        0        1        0       1       0
## 20        0        1        0       1       0
## 21        0        1        0       0       1
## 22        0        1        0       0       1
## 23        0        1        0       0       1
## 24        0        1        0       0       1
## 25        0        0        1       0       0
## 26        0        0        1       0       0
## 27        0        0        1       0       0
## 28        0        0        1       0       0
## 29        0        0        1       1       0
## 30        0        0        1       1       0
## 31        0        0        1       1       0
## 32        0        0        1       1       0
## 33        0        0        1       0       1
## 34        0        0        1       0       1
## 35        0        0        1       0       1
## 36        0        0        1       0       1
## attr(,"assign")
## [1] 1 1 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$Blander
## [1] "contr.treatment"
## 
## attr(,"contrasts")$Bryder
## [1] "contr.treatment"
anova(add, model134)
## Analysis of Variance Table
## 
## Model 1: Respons ~ 0 + Blander + Bryder
## Model 2: Respons ~ Blander * Bryder
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     31 8079308                           
## 2     27 7415475  4    663833 0.6043 0.6629

Hvor bliver Bryder1 af? Note that a \(p\)-value of \(0.66\) is sufficiently high that we can not throw out the hypothesis that there might be independence of effects (vi kan ikke afkaste hypotesen om at der ingen vekselvirkning er)

We may visualise this with

ggplot(opgave13_4, aes(x = Blander, y = Respons, color = Bryder, group = Bryder)) +
    geom_point() +
    geom_line(stat = "summary", fun = "mean", size = 1) +
    xlab("Blander") +
    ylab("Knæktryk") +
    scale_color_discrete("Bryder")

This doesn’t look terribly well for the green line - how could we work on with the hypothesis of additivity?!?!

c)

If we were to continue on with the hypothesis of additivity, we would be able to analyse whether the Blander and Bryder factors have an effect, by themselves. Doing this in :

lmBl <- lm(Respons ~ Blander-1, opgave13_4)
summary(lmBl)
## 
## Call:
## lm(formula = Respons ~ Blander - 1, data = opgave13_4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -888.33 -458.33    7.92  319.79 1231.67 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## Blander1   5048.3      163.5   30.88   <2e-16 ***
## Blander2   4968.3      163.5   30.39   <2e-16 ***
## Blander3   5335.8      163.5   32.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 566.4 on 33 degrees of freedom
## Multiple R-squared:  0.9889, Adjusted R-squared:  0.9879 
## F-statistic: 980.7 on 3 and 33 DF,  p-value: < 2.2e-16
lmBr <- lm(Respons ~ Bryder-1, opgave13_4)
summary(lmBr)
## 
## Call:
## lm(formula = Respons ~ Bryder - 1, data = opgave13_4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -933.33 -310.00  -31.25  337.50  950.00 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)    
## Bryder1   5353.3      150.6   35.56   <2e-16 ***
## Bryder2   5250.0      150.6   34.87   <2e-16 ***
## Bryder3   4749.2      150.6   31.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 521.5 on 33 degrees of freedom
## Multiple R-squared:  0.9906, Adjusted R-squared:  0.9897 
## F-statistic:  1158 on 3 and 33 DF,  p-value: < 2.2e-16

as such we might do individual anovas between there models and the full model.

anova(lmBl, model134)
## Analysis of Variance Table
## 
## Model 1: Respons ~ Blander - 1
## Model 2: Respons ~ Blander * Bryder
##   Res.Df      RSS Df Sum of Sq      F Pr(>F)
## 1     33 10585425                           
## 2     27  7415475  6   3169950 1.9236 0.1132
anova(lmBr, model134)
## Analysis of Variance Table
## 
## Model 1: Respons ~ Bryder - 1
## Model 2: Respons ~ Blander * Bryder
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     33 8975758                           
## 2     27 7415475  6   1560283 0.9468 0.4787

We see that the Blander model has very low \(p\)-value, whereas the Bryder has a significantly higher \(p\)-value, such that we might toss the hypothesis that Blander alone is responsible for the attained strength of the tiles, but cannot do the same for Bryder.

anova(model134)
## Analysis of Variance Table
## 
## Response: Respons
##                Df  Sum Sq Mean Sq F value  Pr(>F)  
## Blander         2  896450  448225  1.6320 0.21424  
## Bryder          2 2506117 1253058  4.5624 0.01963 *
## Blander:Bryder  4  663833  165958  0.6043 0.66290  
## Residuals      27 7415475  274647                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

INTERPRETATION OF THE ABOVE ANOVA, AND OF THE DIFFERENCE BETWEEN DOING ANOVAS WITH AND WITHOUT -1!!!\


NRH Problems

NRH 1.1

NRH 1.2

NRH 1.3

NRH 1.4


BMS Problems

BMS 1.1

BMS 1.2

BMS 1.3

BMS 1.4

BMS 1.5

BMS 1.6

BMS 1.7

BMS 1.8

BMS 3.4

f)

We may start off by defining various starting parameters;

n <- 10^3 #outcomes
mu <- 4
delta <- 3
reprep <- 10^4 #number of repeated experiments
dat <- matrix(NA,n,reprep)

Then running the simulation of reprep=10^{4} datasets of n = 1000 \(U(\mu-\delta,\mu+\delta)\) datapoints.

for (i in 1:reprep) {
  dat[,i] <- runif(n, min=mu-delta, max = mu+delta) 
}

datMax <- apply(dat, 2, max) #apply the max function to all columns (2) of dat
datMin <- apply(dat, 2, min)

#Definitions in problem b)
muhat <- (datMin+datMax)/2 
deltahat <- (datMax-datMin)/2

Note also that

mean(muhat)
## [1] 3.999994
mean(deltahat)
## [1] 2.99399

We may do histograms of the data

p_1 <- qplot(muhat, geom = "histogram")
p_2 <- qplot(deltahat, geom = "histogram")

grid.arrange(p_1,p_2, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To verify c);

mucheck <- apply(dat, 2, median)
mean(mucheck)
## [1] 4.000596
qplot(mucheck)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

BMS 7.1

BMS 7.1a)

Bemærk at Poisson fordelingen, ligesom data, har støtte på \(\mathbb{N}_0,\) således at idet datapunkter fremkommer som om de er af eksponentielt aftagende form, kunne der sagtens være tale om noget Poisson.

BMS 7.1b)

Under antagelse om at data er Poisson fordelt, kan det bemærkes at Poissonfordelingen er entydigt bestemt ved dens middelværdi. Vores bedste estimat for middelværdien i dataen, som er antaget Poisson fordelt vil dermed være gennemsnittet af data. Vi har \(150\) eksperimenter og får

library(pander)
## Warning: package 'pander' was built under R version 4.0.5
n <- 150
matrixx <- matrix(c(0:4,98,40,8,3,1), nrow=2, byrow=T)
matrixx[2,]
## [1] 98 40  8  3  1
data <- as.data.frame(matrixx)
rownames(data) <- c("Number of colonies per dish", "Number of Petri dishes")
data
##                             V1 V2 V3 V4 V5
## Number of colonies per dish  0  1  2  3  4
## Number of Petri dishes      98 40  8  3  1
pander(matrixx)
0 1 2 3 4
98 40 8 3 1
(meandata <- (0*98+1*40+2*8+3*3+4*1)/150)
## [1] 0.46
lambda <- meandata

BMS 7.1c)

Vi vil gerne have lov til at pakke vores poissonmodel ind i en multinomial familie, som vi så kan få lov til at bruge velkendte resultater på, hvorved vi kan analyserer om data kan siges at komme fra den spåede Poisson-model.

Bemærk at vi har fem kategorier af datapunkter; henholdsvis 0 kolonier per petriskål, 1 kolonier per petriskål,…, 4 kolonier per petri skål, således at vi ville kunne få en multinomial model med \(k=4.\)

Som nævnt på side 130 vil vi dog gerne have mindst fem datapunkter for hver kategori.

Vi ser dermed at vi skal reducere til \(k=2\) med de tre kategorier; 0 kolonier per petriskål, 1 kolonier per petriskål, $$2 kolonier per petriskål, således at vi får tabel;

matrixx_2 <- matrix(c(98,40,14), nrow=1, byrow=T)
colnames(matrixx_2) <- c("0","1",">=2")
rownames(matrixx_2) <- c("Number of Petri dishes")
pander(matrixx_2)
  0 1 >=2
Number of Petri dishes 98 40 14

For \(k=2\) kan vi eksempelvis vælge at parametriserer modellen på basis af sandsynlighederne, og vi vælger at parametriserer på basis af \((\pi_0,\pi_1)\in\mathopen{}\left({0,1}\right)\mathclose{}^2,\) hvor \(\pi_0:=P(X=0)\) og \(\pi_1:=P(X=1).\)

Under Poisson modellen for \(X\) kan vi dermed erklære at \(\pi_0:=P(X=0)=\frac{\lambda^0}{0!}e^{-\lambda},\,\pi_1:=P(X=1)=\frac{\lambda^1}{1!}e^{-\lambda}.\)

Eksamens opgaver

August 2020

Opgave 2

We will be reading in data as

restaurant <- read_csv("restaurant.txt", 
    col_types = cols(D = col_factor(levels = c("m", 
        "ti", "o", "to", "f", "l", "s")), 
        B = col_factor(levels = c("br", "fr", 
            "mi")), T = col_factor(levels = c("fa", 
            "ta"))))
table(restaurant$B,restaurant$T, restaurant$D)
## , ,  = m
## 
##     
##      fa ta
##   br  0  0
##   fr  4  4
##   mi  0  0
## 
## , ,  = ti
## 
##     
##      fa ta
##   br  0  0
##   fr  4  4
##   mi  0  0
## 
## , ,  = o
## 
##     
##      fa ta
##   br  0  0
##   fr  4  4
##   mi  0  0
## 
## , ,  = to
## 
##     
##      fa ta
##   br  0  0
##   fr  4  4
##   mi  0  0
## 
## , ,  = f
## 
##     
##      fa ta
##   br  0  0
##   fr  4  4
##   mi  0  0
## 
## , ,  = l
## 
##     
##      fa ta
##   br  4  4
##   fr  0  0
##   mi  4  4
## 
## , ,  = s
## 
##     
##      fa ta
##   br  4  4
##   fr  0  0
##   mi  4  4
count(restaurant, B, T)
## # A tibble: 6 x 3
##   B     T         n
##   <fct> <fct> <int>
## 1 br    fa        8
## 2 br    ta        8
## 3 fr    fa       20
## 4 fr    ta       20
## 5 mi    fa        8
## 6 mi    ta        8
count(restaurant, B, D)
## # A tibble: 9 x 3
##   B     D         n
##   <fct> <fct> <int>
## 1 br    l         8
## 2 br    s         8
## 3 fr    m         8
## 4 fr    ti        8
## 5 fr    o         8
## 6 fr    to        8
## 7 fr    f         8
## 8 mi    l         8
## 9 mi    s         8
count(restaurant, D, T,B)
## # A tibble: 18 x 4
##    D     T     B         n
##    <fct> <fct> <fct> <int>
##  1 m     fa    fr        4
##  2 m     ta    fr        4
##  3 ti    fa    fr        4
##  4 ti    ta    fr        4
##  5 o     fa    fr        4
##  6 o     ta    fr        4
##  7 to    fa    fr        4
##  8 to    ta    fr        4
##  9 f     fa    fr        4
## 10 f     ta    fr        4
## 11 l     fa    br        4
## 12 l     fa    mi        4
## 13 l     ta    br        4
## 14 l     ta    mi        4
## 15 s     fa    br        4
## 16 s     fa    mi        4
## 17 s     ta    br        4
## 18 s     ta    mi        4
table((restaurant$B:restaurant$T), restaurant$D) #Table for (B x T) min D
##        
##         m ti o to f l s
##   br:fa 0  0 0  0 0 4 4
##   br:ta 0  0 0  0 0 4 4
##   fr:fa 4  4 4  4 4 0 0
##   fr:ta 4  4 4  4 4 0 0
##   mi:fa 0  0 0  0 0 4 4
##   mi:ta 0  0 0  0 0 4 4
table(interaction(restaurant$B, restaurant$T, restaurant$D))
## 
##  br.fa.m  fr.fa.m  mi.fa.m  br.ta.m  fr.ta.m  mi.ta.m br.fa.ti fr.fa.ti 
##        0        4        0        0        4        0        0        4 
## mi.fa.ti br.ta.ti fr.ta.ti mi.ta.ti  br.fa.o  fr.fa.o  mi.fa.o  br.ta.o 
##        0        0        4        0        0        4        0        0 
##  fr.ta.o  mi.ta.o br.fa.to fr.fa.to mi.fa.to br.ta.to fr.ta.to mi.ta.to 
##        4        0        0        4        0        0        4        0 
##  br.fa.f  fr.fa.f  mi.fa.f  br.ta.f  fr.ta.f  mi.ta.f  br.fa.l  fr.fa.l 
##        0        4        0        0        4        0        4        0 
##  mi.fa.l  br.ta.l  fr.ta.l  mi.ta.l  br.fa.s  fr.fa.s  mi.fa.s  br.ta.s 
##        4        4        0        4        4        0        4        4 
##  fr.ta.s  mi.ta.s 
##        0        4
tablesum <- function(x) {
   newrow <- apply(x,2,sum) #columnsums
   newx <- rbind(x,newrow)
   newcol <- apply(newx,1,sum) #rowsums
   newnewx <- cbind(newx,newcol)
   colnames(newnewx)[ncol(newnewx)] <- c("RowSum")
   rownames(newnewx)[nrow(newnewx)] <- c("ColSum")
   newnewx
}

# balanceequation <- function(x,S=F) { #Fancy, can be used with tablesum function
#    nuco <- ncol(x) - S
#    nuro <- nrow(x) - S
#    for (i in 1:nuro) {
#       for (j in 1:nuco) {
#          x[i,]
#       }
#    }
# }
SamhBalanceEquation <- function(x) { 
   #Cannot be used with tablesum function,
   # !!! requires connected components (sammenhængende design). !!!
   sx <- sum(x)
   testy <- x
   rs <- apply(x,1,sum)
   cs <- apply(x,2,sum)
   nuco <- ncol(x)
   nuro <- nrow(x)
   for (i in 1:nuro) {
      for (j in 1:nuco) {
         if (x[i,j]==rs[i]*cs[j]/sx) {
            testy[i,j] <- TRUE
         } else {
         testy[i,j] <- FALSE
       }
      } 
   }
   if (sum(testy) == nuco*nuro) {
      TRUE
   } else {
     FALSE
   }
}
attach(restaurant)
## The following object is masked _by_ .GlobalEnv:
## 
##     X
## The following object is masked from package:base:
## 
##     T
SamhBalanceEquation(table(restaurant$B, restaurant$T))
## [1] TRUE
SamhBalanceEquation(table(interaction(restaurant$B, restaurant$T), restaurant$D))
## [1] FALSE
#Expand to make it test balance equation for 'ikke sammenhængene' designs.

2.3

We may do the models

m0 <- lm(X~D+B*T, data=restaurant)
m1 <- lm(log(X)~D+B*T, data=restaurant)
StdresQQPlot(m0)

StdresQQPlot(m1)

2.4

Note that this

Stat 1: June 2017

Opgave 1

Opgave 1.4

Look at page 67, note that we have a simple hypothesis of \(\theta=4,\) such that \(d\) (the dimension of the hypothesis) \(=0,\) and \(m\) (the dimension of the parameter space in the general model) \(=1\)

t <- c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0)
x <- c(0.88,0.86,0.62,0.79,0.66,0.54,0.88,0.87,0.73,0.79)
n <- length(x)
(y <- -t*log(x))
##  [1] 0.01278334 0.03016458 0.14341074 0.09428893 0.20775772 0.36971168
##  [7] 0.08948336 0.11140965 0.28323967 0.23572233
(S <- sum(y))
## [1] 1.577972
(ymean <- S/n)
## [1] 0.1577972
theta <- 4
LRfunc <- function(theta) {
   2*n*(-log(theta)+theta*ymean-log(ymean)-1)
}
LR <- LRfunc(theta=4)
pchisq(LR, df = 1)#So p-value; 
## [1] 0.8234918
1-pchisq(LR, df = 1)
## [1] 0.1765082

Så vi kan ikke forkaste hypotesen om at \(\theta=4.\)

Opgave 1.5

Vi har i opgave 1.2 beregnet at \(\hat\theta = -\frac{n}{t_i\cdot\sum_{i=1}^{n}{\log{x_i}}},\) så for data, får vi en \(\hat\theta\) af størrelse:

(hattheta <- -n/(sum(t*log(x))))
## [1] 6.337248

Dermed, kan vi finde \(95\%\) fraktil i \(\chi^2_{1}\) fordelingen, og finde det passende konfidensinterval; se s. 98 BMS:

(g025 <- qgamma(0.025, n, 1))
## [1] 4.795389
(g975 <- qgamma(0.975, n, 1))
## [1] 17.0848
c(g025/S, g975/S)
## [1]  3.038957 10.827064
qchisq(0.95, 1)
## [1] 3.841459
#uniroot(LRfunc, c(mean(LRfunc(4))/10,mean(LRfunc(4)))) #Virker ikke.
seqT <- seq(0.001,15, by=0.01)
vals <- LRfunc(seqT)
plot(seqT,vals)

Opgave 1.6

Vi bruger kommandoer;

theta <- 5
sim <- function(n) {
q <- c()
for (i in 1:5000) {
   t <- (1:n)/n
   x <- rbeta(n, shape1=5*t, shape2=1)
   q[i] <- -n/(sum(t*log(x)))
   }
q
}
dat10 <- sim(10)
dat25 <- sim(25)
dat250 <- sim(250)
(mdat10 <- mean(dat10))
## [1] 5.602083
(mdat25 <- mean(dat25))
## [1] 5.216197
(mdat250 <- mean(dat250))
## [1] 5.017503
(sddat10 <- sd(dat10))
## [1] 2.01727
(sddat25 <- sd(dat25))
## [1] 1.079147
(sddat250 <- sd(dat250))
## [1] 0.3210583
seqT <- seq(0.001,15, by=0.01)
nsim <- function(n) dnorm(seqT, theta, sd=sqrt(theta^2/n))
par(mfrow=c(1,3))
hist(dat10, prob=T); lines(x=seqT, y=nsim(10))
hist(dat25, prob=T); lines(x=seqT, y=nsim(25))
hist(dat250, prob=T); lines(x=seqT, y=nsim(250))

Opgave 2

Stat2 Juni 18

Opgave 3

moral <- read_table2("moral.txt", 
    col_types = cols(koen = col_factor(levels = c("mand", 
        "kvinde")), rang = col_factor(levels = c("officer", 
        "menig"))))

3.1

3.2+3.3

Note that \(dim(V_G),\,G\in\mathbb{G_{+1}}\equiv\mathbb{G}\cup\mathopen{}\left\{{1}\right\}\mathclose{},\) will be…

sum((moral$score)^2)
## [1] 288373.4
dL1 <- 1
dLK <- 2
dLR <- 2
dLKxR <- 4
dLI <- 225

dV1 <- dL1
(dVR <- dLR-dV1)
## [1] 1
(dVK <- dLK-dV1)
## [1] 1
(dVKxR <- dLKxR - dVK - dVR - dV1)
## [1] 1
(dLKpR <- dVK+dVR+dV1)
## [1] 3
P1 <- 251710
PK <- 252453
PR <- 254250
PKxR <- 255038
PI <- 288373 #=||X||
Q1 <- P1
(QK <- PK-Q1)
## [1] 743
(QR <- PR-Q1)
## [1] 2540
(QKxR <- PKxR-QK-QR-Q1)
## [1] 45
(QI <- PI-QKxR-QK-QR-Q1)
## [1] 33335
(PKpR <- QK + QR + Q1)
## [1] 254993
(QI-PKxR)
## [1] -221703
(taeller <- (PKxR-PKpR)/(dLKxR-dLKpR))
## [1] 45
(naevner <- (PI-PKxR)/(dLI-dLKxR))
## [1] 150.8371
(Feren <- taeller/naevner)
## [1] 0.2983351
pf(Feren, df1 = dLKxR-dLKpR, df2 = dLI-dLKxR, lower.tail = F)
## [1] 0.585479

Alternatively:

m1 <- lm(score~koen+rang, data=moral)
anova(m1,m0)
## Warning in anova.lmlist(object, ...): models with response '"X"' removed because
## response differs from model 1
## Analysis of Variance Table
## 
## Response: score
##            Df Sum Sq Mean Sq F value   Pr(>F)    
## koen        1    743  743.09   4.942  0.02722 *  
## rang        1   2539 2539.33  16.888 5.58e-05 ***
## Residuals 222  33381  150.36                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.4

table(moral$koen, moral$rang)
##         
##          officer menig
##   mand        60   120
##   kvinde      15    30
m0 <- lm(score~koen*rang, data=moral)
summary(m0)
## 
## Call:
## lm(formula = score ~ koen * rang, data = moral)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.423  -8.050  -0.861   8.267  32.709 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            36.973      1.586  23.318  < 2e-16 ***
## koenkvinde              6.127      3.545   1.728 0.085342 .  
## rangmenig              -6.651      1.942  -3.425 0.000732 ***
## koenkvinde:rangmenig   -2.376      4.342  -0.547 0.584791    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.28 on 221 degrees of freedom
## Multiple R-squared:  0.09076,    Adjusted R-squared:  0.07842 
## F-statistic: 7.353 on 3 and 221 DF,  p-value: 0.0001016

We may draw out the estimates:

(cof <- m0$coefficients)
##          (Intercept)           koenkvinde            rangmenig 
##            36.972667             6.127333            -6.651250 
## koenkvinde:rangmenig 
##            -2.376083
cof[2] <- cof[1]+cof[2]
cof[3] <- cof[1]+cof[3]
cof[4] <- cof[1]+cof[4]
names(cof) <- c("mand, officer", "kvinde, officer", "mand, menig", "kvinde, menig")
cof
##   mand, officer kvinde, officer     mand, menig   kvinde, menig 
##        36.97267        43.10000        30.32142        34.59658

We may also do a std. res + QQ-plot

StdresQQPlot(m0)

Looking at the QQplot, there seems to be no signifigant anomilies to speak of. In regards to homogeneity of the variance, this also seems rather plausable when looking at the Std. Res. plot. There might seem to be slightly less variation of the performance of female officers, than the other groupings, though it is not far off, and could also be attributed the low number of female officers counted: 15.

3.5

By 3.3, we get that we get a \(p\)-value for the \(F\)-test of the hypothesis of 0.585479, such that we may not throw the hypothesis of there being no interaction effects present out.

3.6

Note from the summary of the ‘ingen vekselvirkningsmodel’ m1 that based on the summary, \(\alpha=0.05\) confidence level \(p\)-values are calculated, for whether we might remove any

m1 <- lm(score~koen+rang, data=moral)
summary(m1)
## 
## Call:
## lm(formula = score ~ koen + rang, data = moral)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.056  -7.849  -0.813   8.421  32.867 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   37.289      1.474  25.303  < 2e-16 ***
## koenkvinde     4.543      2.044   2.223   0.0272 *  
## rangmenig     -7.126      1.734  -4.109 5.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.26 on 222 degrees of freedom
## Multiple R-squared:  0.08953,    Adjusted R-squared:  0.08133 
## F-statistic: 10.91 on 2 and 222 DF,  p-value: 3.01e-05

We see that testing away \(K\) can be done at a Pr(>|t|) = 0.0272205<0.05 level. Such that if we restricted to a \(\alpha=0.01\) level, we could have said that at such a level \(K\) would not be significant, but with a level \(\alpha=0.05,\) both koen and rang are significant.

Alternatively, we might look at

confint(m1, level=0.95)
##                   2.5 %    97.5 %
## (Intercept)  34.3851622 40.193793
## koenkvinde    0.5157167  8.570839
## rangmenig   -10.5439656 -3.708968

of which we see that \(0\) is not included in any of the intervals, so both \(K\) and \(R\) are significant at a \(\alpha=0.05\)-level.

3.7

We are in a hypothesis of there being no interactional effects of koen and rang, we will thus have three parameters: intercept (with reference group mand,officer) and the two contrasts, one for koen and one for rang.

overv <- cbind(coef(m1),confint(m1, level=0.95))
colnames(overv)[1] <- "Estimate"
overv
##              Estimate       2.5 %    97.5 %
## (Intercept) 37.289478  34.3851622 40.193793
## koenkvinde   4.543278   0.5157167  8.570839
## rangmenig   -7.126467 -10.5439656 -3.708968

The model tells us that the estimate for the score for a male officer (reference population occupying (Intercept)) is 37.2894778. Switching from this reference population to a female officer would yield an estimated improvement of score of 4.5432778 totalling 41.8327556, while switching from the (Intercept) reference to a menig male, we would get an improvement (non-improvement) of a menig male over a male officer of 30.1630111. As a final point, one might note that the estimate for a menig female will then get an improvement of 37.2894778+4.5432778= -2.5831889 with respect to the reference group, (remember that we assume no interactional effects) thus totalling 34.7062889.

As in 3.6, we may conclude that there is an \(\alpha=0.05\) significant difference between the performance of males and females (with estimate for the difference being 4.5432778).

Exploring the significance
confint(m1, level=1-0.0272)
##                    1.36 %   98.64 %
## (Intercept)  3.401283e+01 40.566123
## koenkvinde  -6.110427e-04  9.087167
## rangmenig   -1.098208e+01 -3.270849
confint(m1, level=1-0.02)
##                     1 %      99 %
## (Intercept)  33.8361123 40.742843
## koenkvinde   -0.2456785  9.332234
## rangmenig   -11.1900308 -3.062903
confint(m1, level=1-0.0273)
##                    1.36 %   98.64 %
## (Intercept)  34.014984266 40.563971
## koenkvinde    0.002372294  9.084183
## rangmenig   -10.979552739 -3.273381